Research Ideas and Outcomes :
Conference Abstract
|
Corresponding author: Amirpasha Mozaffari (a.mozaffari@fz-juelich.de)
Received: 14 Sep 2022 | Published: 12 Oct 2022
© 2022 Amirpasha Mozaffari, Niklas Selke, Martin Schultz
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Mozaffari A, Selke N, Schultz M (2022) Advancing caching and automation with FDO. Research Ideas and Outcomes 8: e94856. https://doi.org/10.3897/rio.8.e94856
|
Introduction: Geosciences are utilising big data that is constantly updated, modified, and changed with an ever-growing stream of new measured, modelled accumulated data (
Case study: TOAR database: The Tropospheric Ozone Assessment Report (TOAR) database (
Two specific challenges to overcome in the design of automated workflow with an FDO-enabled caching system are ensuring that the query stays connected to the correct data and establishing a schedule for pre-calculating the most frequently used statistical aggregates. In the following, we discuss these two challenges in more detail.
Caching system: There are other data providers in the field, but they commonly focus mainly on archiving the measurement data. We want an analysis tool with the fastest possible response times for the users. Furthermore, we want to look into FDOs to ensure the preservation of queries and make them reusable and traceable. For the caching itself, it is very important that the cache key created for a query allows for verification that the data used in computing the query the first time did not change when trying to reuse the cached result. In our conceptual work, we want to develop a concept and a demonstrator of an atmospheric data analysis cache. This includes choices for the underlying technical solution (e.g. PostgreSQL, MongoDB, Redis...), the definition of data structures and hash codes, design of a mechanism for the triggering of re-calculations, definition of a schedule for automated cache updates, and various aspects related to query documentation and reproducibility of results. Technical obstacles caused by the expected size of up to 0.8 Terabytes for the TOAR and complicated scalability issues that can arise should be considered for a possible solution. Ideally, the caching system should be agnostic of the underlying database/server choice to enhance portability.
Automated workflow: The second challenge to address here is to combine the envisioned caching system with a flexible workflow scheme. Such a workflow setup enables preparing pre-compile and calculating the most frequently used statistical aggregate ahead of user demand. Queries can either be triggered by a user (demand-driven) or by an automatic system which will compute commonly used queries without a user having to trigger it (provider-driven) to have as many query results as possible ready to go for users, so they do not have to wait for the results after they have sent a request. User requests are categorised according to the availability of the statistical products and the required computation effort. Some might have already been calculated and stored as FDO can be quickly reloaded and processed further. While some queries might be new, but still possible to be calculated on the fly, and the responses could be delivered on near real-time basis. In contrast, some more intensive statistical aggregations require HPC. We believe an automated FDO-enabled caching system will utilise the metadata and FDO to provide on-demand data requests and reduce repetitive computation. It paves the way for intelligent computation that can be scheduled at different times of the day based on the priority and availability of resources, and reduction of the energy consumption and carbon footprint of computing.
Outlook and next steps: In our conceptual work and demonstrator, we aim not to use the FDO only to ensure long-term preservation but rather to create a use case where FDO is used for practical reusability in the daily operation of the database. An automated FDO-enabled caching system requires multiple components to work synchronously. In the following months, we will focus on creating a demonstrator for the caching system, adopting an FDO typing that could fit the best with the planned tasks, and creating a workflow management system that could support such a dynamic system with interfaces to API-enabled web-services, cloud computing resources and conventional HPC resources.
FAIR Digital Object (FDO), database, caching system, automated workflow
Amirpasha Mozaffari
First International Conference on FAIR Digital Objects, presentation
Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, Germany