Research Ideas and Outcomes :
Research Article
|
Corresponding author: Anne Fouilloux (annef@simula.no)
Academic editor: Francisco Andres Rivera Quiroz
Received: 28 Jun 2023 | Accepted: 08 Aug 2023 | Published: 05 Sep 2023
© 2023 Anne Fouilloux, Elisa Trasatti, Federica Foglini, Alejandro Coca-Castro, Jean Iaquinta
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Fouilloux A, Trasatti E, Foglini F, Coca-Castro A, Iaquinta J (2023) FAIR Research Objects for realising Open Science with the EOSC project RELIANCE. Research Ideas and Outcomes 9: e108765. https://doi.org/10.3897/rio.9.e108765
|
|
The numerous benefits of Open Science (OS) and of the four FAIR foundational principles - Findable, Accessible, Interoperable and Reusable - are increasingly valued in academia, although what OS and FAIR entail is still largely misunderstood. In such conditions, putting into practice OS and applying the FAIR principles is challenging and underrated. However, realising OS is perfectly within our grasp provided that an infrastructure supporting the management of the research lifecycle is available. ROHub (https://www.rohub.org/) is a Research Object (RO) management platform implementing three complementary technologies: Research Objects, Data Cubes and Text Mining services. ROHub enables researchers to collaboratively manage, share and preserve their research while they are still working on it (rather than after the work is finished). In this paper, three communities from Earth Sciences, namely Geohazards, Sea Monitoring and Climate Change, demonstrate how ROHub helped them to understand each other and to work openly and, more importantly, how communities of practice play an important role in facilitating reuse and interdisciplinary collaboration. These findings are illustrated with several use cases from these various communities.
research object, reproducibility, replicability, reusability, interdisciplinary, open science practices, environmental sciences
Open Science (OS) emphasises collaboration, transparency and sharing of ideas, data, software, workflows and methods (
This paper extends what was presented at the 1st international conference on FAIR digital Objects (
The RELIANCE project (REsearch LIfecycle mAnagemeNt for Earth Science Communities and CopErnicus users in EOSC) delivers a suite of innovative and interconnected services that extend European Open Science Cloud (EOSC)’s capabilities to support the management of the research lifecycle within Earth Science Communities and Copernicus Users. The project has delivered three complementary technologies: Research Objects (ROs), Data Cubes and AI-based Text Mining.
ROHub (https://www.rohub.org/) is a Research Object management platform that implements these three technologies: it has been developed to enable researchers to collaboratively manage, share and preserve their research work. ROHub implements the full RO model and paradigm: resources associated to a particular research work are aggregated into a single digital entity (the Research Object) and metadata relevant for understanding and interpreting the content is represented as semantic metadata that is user and machine readable.
By using ROHub, practitioners can ensure that their research work is well-organised and easily accessible to collaborators, while also being preserved for future use. The fact that ROHub is implementing the RO model and paradigm are especially significant, since this means that the platform is designed to meet the highest standards of data management and sharing. The use of contextual metadata is also a great feature, as it ensures that important contextual information about the research work is well preserved and can be easily understood by both humans and machines. Overall, ROHub is a valuable tool for anyone looking to improve data management, sharing practices and more generally working following Open Science principles.
RO-Crate (
RO-Crate enables the inclusion of additional metadata fields and the use of different metadata standards, depending on the requirements of the project. RO-Crate is also designed to be compatible with existing data and metadata standards, making it easy to integrate with repositories.
The benefits of using RO-Crate and Research Objects in general are, amongst others, increased transparency and reproducibility in research, improved data management and sharing and the ability to more easily reuse and build upon existing research. By providing a standardised format for creating and sharing research objects, RO-Crate can facilitate new collaborations, data reuse and knowledge discovery, leading to more efficient and effective scientific research practices.
RO-Crate enables a high degree of interoperability within ROHub. Nevertheless, various disciplines have evolved their own procedures and description standards and the concept of FAIR Digital Objects (FDOs) have emerged. FDOs are independent from the metadata descriptions, allowing them to include various description standards. RO-Crate can be seen as one possible implementation of FDOs if used along with FAIR signposting (
An RO in ROHub commonly begins its life as an empty "Live RO". ROs aggregate new objects through their whole life-cycle (
In ROHub, one can copy and keep ROs in time through snapshots which reflect their status at a given point in time: the "original" RO is still available and can continue to evolve. Snapshots can have their own Digital Object Identifiers (DOIs) which facilitate tracking the evolution of the research. Eventually, an RO in ROHub can be published and archived (so called "Archived RO") with a permanent identifier (DOI): it then becomes immutable. In ROHub, new Live ROs can be derived, based on an existing Archived RO, for instance, by forking it. Many ROs cited in this paper have not yet been archived because the associated research work is still on-going and not yet published: ROHub and ROs are supporting FAIR and Open Science practices.
To guide researchers, different types of Research Objects can be created from templates in ROHub:
Bibliography-centric: includes manuals, anonymous interviews, publications, multimedia (video, audio) and/or other material that support research.
Data-centric: refers to datasets which can be indexed, discovered and manipulated. Data cubes are particular data-centric ROs that can be discovered with data cube services such as the ADAM platform (
Executable: includes the code, data and computational environment along with a description of the research object and, in some cases, a workflow. This type of ROs can be executed via specific services and is often used for scripts and/or Jupyter notebooks.
Software-centric: also known as “Code as a Research Object”. Software-centric ROs include source codes and associated documentation. They often contain sample datasets for running tests.
Workflow-centric: contains workflow specifications, provenance logs generated when executing the workflows, information about the evolution of the workflow (version) and its components/elements and additional annotations for the entire workflow.
Basic: can contain anything and is used when the other types do not fully cover the creator's need.
To facilitate the understanding and the reuse of the ROs in ROHub, each of these types of ROs (except Basic RO) has a template folder structure that we recommend researchers to select. For instance, an executable RO in ROHub has four folders:
"biblio": where researchers can aggregate documentations, scientific papers that support the development of the software/tool that is in the tool folder;
"input": where all the input datasets required for executing or reusing the RO are aggregated;
"output": where some or all the results generated by executing the RO are aggregated;
"tool": where the executable tool is aggregated. Typically, one aggregates a Jupyter notebook and/or executed workflows (Galaxy, Snakemake or Common Workflow Language workflows).
In addition to the different types of ROs and associated template structures, researchers can select the type of resources that constitutes the main entity of their RO: for instance, a Jupyter notebook can be selected as the main entity of an executable RO. As shown on Fig.
Example of executable Research Object with a Jupyter notebook as a main resource. DOI: https://doi.org/10.24424/pf69-pg61.
The general overview of any type of Research Object is always the same, with mandatory metadata information such as the title, description, authors and collaborators, sketch (featured plots/images), the content of the RO (with different structures depending on the type of RO). Additional information is displayed on the right panel, such as number of downloads, additional discovered metadata (automatically extracted from the text content of ROs by the RELIANCE text enrichment services), free keywords (added by the end-users) and citation. Regarding the text mining feature, an additional tab called "Enrichment" has been added to provide more comprehensive information. This additional feature has been requested by end-users. However, it is still under development and information presented is sometime difficult to grasp for newcomers, but it is nonetheless helpful for cross-disciplinary research. The toolbox and share sections allow end-users to download, snapshot and archive a given RO and/or share it. All the ROs in ROHub are digital objects that are FAIR and, for instance, findable in Openaire explore, including Live ROs.
The development of ROHub has been ongoing for several years (
Basic ROs are intended for selection when none of the other types of ROs in ROHub is fit for purpose or when a very small amount of resources are to be aggregated. One common usage of Basic ROs is for aggregating videos and presentations delivered during conferences, workshops or other events. For example, the basic RO "AGU 2022 - Environmental Data Science Book: a community-driven resource showcasing open-source Environmental science" (
An example of Bibliography-centric RO is displayed in Fig.
Bibliographical Research Object entitled "Virunga Volcanoes Supersite Biennial Report: 2020- 2021" and containing detailed report by INGV from the Virunga Volcano Supersite. This RO has a permanent identifier: https://w3id.org/ro-id/45841548-0362-4aea-80f2-ea71d81a691f.
Data-centric ROs are used to create FAIR datacubes (
By default, a data-centric RO would contain the following folders:
EU FAR - EU Funds by Area Results (
In the RELIANCE project, the concept of FAIR datacube (
Data-centric Research Object with datacube collection from the Copernicus Atmosphere Monitoring Service (CAMS) European air quality forecasts (
The RELIANCE ADAM platform has been integrated in ROHub which simplifies the creation of datacubes in ROHub: all the metadata are automatically extracted and added to the RO. It is possible to add any types of datacubes in ROHub, but at the moment, all the necessary metadata would need to be created manually which makes it difficult for end-users. While this limitation could be lifted in the future, there would still be a need for users to create data-centric ROs with datasets they generated or derived from datacubes or other datasets and that may not be stored as datacubes (typically vector data).
As part of the RELIANCE project, a collaboration with the Norwegian Infrastructure for Research Data (NIRD) and the Polytechnic University of Madrid (UPM) has been established. UPM automatically created data-centric ROs from the datasets already stored in the NIRD archive (
Software-centric ROs were initially created for researchers to share software, for instance, Python packages, such as the "Volcanic and Seismic source Modelling (VSM)" (
Workflow-centric ROs allow the storage and sharing of the "process" used by researchers. This can be either an automated workflow using a Workflow Management System (Galaxy, Cylc, Snakemake, Nextflow etc.) or a simple script or text file detailing the list (and order) of tasks that need to be executed to reproduce the research results. For instance. Galaxy (
Executable ROs are very similar to workflow-centric ROs and, actually, many users consider them interchangeably. However, in that case, the workflow is executed on real datasets and not on a sample/test dataset; for example, the actual research outputs can be fully reproducible and reusable. In the section below, examples are provided for Galaxy workflows and interactive Jupyter notebooks. We then discuss the need for best practices when writing Jupyter notebooks to improve their re-usability beyond the state-of-the-art.
Another very "common" usage of executable ROs in ROHub is for curating computational notebooks where the main resource is simply a Jupyter notebook. Such Jupyter notebooks are widespread in many scientific disciplines and, in particular, among Earth Scientists. JupyterHub and/or Binder are often used by researchers to highlight the reproducibility of their work or part of it. The Binder Project (
Being able to re-execute a complex workflow is very important, for instance, to automate a repetitive pipeline relying on daily weather forecasts or as the basis for deriving new research work. The description of a workflow used in any standard Workflow Management System is often insufficient to understand how to reuse it. Examples and real-life use case workflow execution with inputs and the corresponding generated outputs, links to documentation, papers or tutorials are useful for end-users. An executable RO can be created to gather all the information related to the execution of a computational workflow. When using the Galaxy platform, Galaxy tools and workflows are fully annotated (
Along the same lines, executable ROs can be used to exemplify the usage of a given tool: for instance, "Galaxy CESM Tool Example" (
The integration of EGI notebook and EGI Binder in ROHub significantly increases the re-usability of an executable RO, in particular, Jupyter notebooks. The executable RO "Changes in air and water quality during the Covid-19 Lockdown in the Venice Lagoon" (
Reusing Jupyter notebooks for cross-disciplinary research is often challenging, but this becomes much easier with ROHub. First, the text mining enrichment service can help users to find relevant ROs. Second, the integration with EGI notebook and EGI Binder allows users to replay Jupyter notebooks from ROs (one simply has to right-click on the Jupyter notebook resource to be re-directed to the EGI notebook or Binder service). By default, users are redirected to the EGI notebook service where the user can select one of the available computational environments to execute the notebook. However, if the EGI notebook has been upgraded after the notebook's creation, there is no guarantee that the execution will be successful. To improve the "long-term" reproducibility, users can associate a customised computational environment with the notebook: when the RO contains a computational environment (such as Pip's requirements.txt or Conda's environment.yaml) that is linked to the notebook*
Reproducibility is the first and necessary step to build beyond the state-of-the-art (as well as proper licences, such as MIT licences). Then both communities started to work together and investigated the creation of a combined use case where both point of views, for example, atmospheric air quality and water quality would be investigated over the Venice Lagoon: a new notebook was then derived. All team members described this step as much smoother than usual, thanks to ROHub and its integration with EGI notebook. Futhermore, data were already shared from the two original ROs, therefore, downloading was not an issue either.
The previous example (
The Environmental Data Science Book (EDS Book,
The quality of the published content is achieved by an open review policy supported by GitHub related technologies. Beyond the reproducibility that is ensured at the publication stage, the EDS book facilitates reuse. Let us take a popular notebook example from the EDS book: Fig.
Several of the ROs created and curated by the EDS book community have been reused. Overall, the feedback from the environmental science community is very positive; however, the need for understanding a specific programming language (Python, Julia, R) remains. This is clearly a barrier for inter-disciplinary research because researchers do not usually know many programming languages and each scientific discipline often makes use of a particular programming language. For instance, R is widely used amongst ecologists, whereas Python is not as well-known in that community. On the other hand, the situation is reversed for climate modellers. An idea that needs to be explored is the creation of "individual" modular containers, for example, canonical workflow building blocks (
While other platforms exist, such as WorkflowHub, Aperture Neuro or BioCompute Objects, none of them was meant to accommodate specific needs of the Earth-Science communities. At the beginning of the RELIANCE project, ROs were still mostly created when the work was finished, for example, to aggregate results produced within a research project and for publication purposes only, since some journal editors started to make it mandatory to provide supplementary material additions to published papers. Then, at best, having ROs when starting a project and/or reusing existing ROs to create derivative works was seen as "useful" by researchers. However, when ROHub began to integrate EOSC services, such as EGI datahub, EGI notebook or EGI Binder, ROs became "live FAIR digital objects" that evolve at the same pace as the research work and with little additional effort from researchers. Gradually, it became "convenient", since it was very straightforward to make data and documents available for co-workers with a single location (instead of having copies) and to share Jupyter notebooks (including not only the source code, but also outputs), so that they could get feedback on the implemented methods, interpretation of results, alternative approaches etc. There are still several features in ROHub that are not fully exploited. For instance, most ROs in ROHub have permalinks, but are never archived or snapshots created: ROs in ROHub are indexed on OpenAIRE and most end-users do not understand the potential advantages of archiving ROs and, more importantly, creating snapshots. More training (self-pace material or online videos etc.) with concrete and real-life use cases exemplifying the advantages of each of the ROHub features would be helpful.
The text mining services also, as they improved over time, based on users feedback, now bring more information about the Research Objects, since they can access not only purely text documents (papers etc.), but also other metadata and what is a novelty: the source code itself within Jupyter notebooks. This makes it possible to discover ROs potentially relevant to researchers who would not have looked into them, based on "ordinary" keywords only. In addition, the derived semantic metadata can be used to deliver more accurate search results and content-based recommendations with so-called "Collaboration Spheres" (
The number of ROs increases steadily with more than 3000 ROs and 150 users (March 2023): the vast majority of these ROs (about 2000) are bibliographical resources and basic ROs that contain reports, videos or other resources that would not be easily findable otherwise. Data-centric ROs are mostly datacubes which can be easily explained by the possibility to discover datacubes with the ADAM platform: for data providers, this is clearly a way to advertise their data platform and track the usage of datasets. Collecting statistics and tracking reuse of data-centric ROs could be a way for data providers to optimise their platform and develop a more user-centric roadmap. Executable ROs are becoming more and more popular since the EGI notebooks and EGI Binder have been integrated into ROHub: these EOSC services seamlessly allow us to reproduce and reuse Jupyter notebooks that can require significant computational and storage resources.
ROHub has played a central role in the early adoption of the Open Science and FAIR principles by several Earth Sciences communities dealing with Geohazards, Sea Monitoring and Climate Change. It provided an easy-to-use and accessible infrastructure where different types of FAIR Research Objects could be created by scientists and shared with their colleagues or with the rest of the world. The way ROHub itself was used has significantly evolved between the beginning of the RELIANCE project towards its end. This demonstrated a change of mindset and the realisation that the products of research could be much more than mere communications and that collaborative work promotes creativity, innovation and cross-skilling (Open Science) that can significantly improve the quality of research outputs.
In the near future with more compute and storage resources made available (GPUs, HPCs etc.) and with, for instance, "collaborative" Jupyter notebooks (where several contributors will be able to work simultaneously on the same piece of code, as is already done on text documents), exploiting platforms like ROHub will be a no-brainer to save time and energy from original ideas, to advance science, to involve more actors in the research process and/or exploitation of research products, all the while making clearly visible everybody's actual contributions. Once that is understood, researchers will be able to contribute more "casually" to the discussion on Open Science principles and how to apply these principles to their own discipline and in their respective communities. This is where community of practice comes into play and highlights the importance to have space and "venues" to discuss these best practices.
The RELIANCE (REsearch LIfecycle mAnagemeNt for Earth Science Communities and CopErnicus users in EOSC) project has received funding from the European Union’s Horizon 2020 INFRAEOSC programme under grant agreement No 101017501. Alejandro Coca-Castro's work was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/W006022/1, particularly the “Environment and Sustainability” theme within that grant and the Alan Turing Institute.
This requires the addition of metadata "Software Requirements" as well as the corresponding computational environment file to the notebook.