Research Ideas and Outcomes :
Conference Abstract
|
Corresponding author: Stian Soiland-Reyes (soiland-reyes@manchester.ac.uk), Laurence Livermore (l.livermore@nhm.ac.uk)
Received: 01 Sep 2022 | Published: 12 Oct 2022
© 2022 Oliver Woolland, Paul Brack, Stian Soiland-Reyes, Ben Scott, Laurence Livermore
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Woolland O, Brack P, Soiland-Reyes S, Scott B, Livermore L (2022) Incrementally building FAIR Digital Objects with Specimen Data Refinery workflows. Research Ideas and Outcomes 8: e94349. https://doi.org/10.3897/rio.8.e94349
|
Specimen Data Refinery (SDR) is a developing platform for automating transcription of specimens from natural history collections (
We show our recent experiences with building SDR using the Galaxy workflow system and combining two FDO methodologies with open digital specimens (openDS) and RO-Crate data packaging. We suggest FDO improvements for incremental building of digital objects in computational workflows.
SDR workflows
SDR is realised as the workflow system Galaxy (
We implemented the use case De novo digitization in Galaxy (
Draft Galaxy workflow De Novo digitization (
Galaxy can visualise outputs of each step (Fig.
We are adding workflows for partial stages, e.g. detection of regions (
We are now ready to publish digital specimens as FAIR Digital Objects, with registration into DiSSCO repositories, PID assignment and workflow provenance. However, even at this early stage we have identified several challenges that need to be addressed.
FDO lessons
We highlight the De novo use case because this workflow is exchanging partial FDOs – openDS objects which are not fully completed and not yet assigned persistent identifiers. openDS schemas are still in development, therefore SDR uses a more flexible JSON schema where only the initial metadata (populated from CSV) are required. Each step validates the partial FDO before passing it to the underlying command line tool.
Although workflow steps exchange openDS objects, they cannot be combined in any order. For instance, named entity recognition requires digitised text in the FDO. We can consider these intermediate steps as sub-profiles of an FDO Type. Unlike hierarchical subclasses, these FDO profiles are more like ducktyping. For instance a text detection step may only require the regions key, but semantically there is no requirement for an OpenDSWithText to be a subclass of OpenDSWithRegion, as text also can be transcribed manually without regions.
Similarly, we found that some steps can be executed in parallel, but this requires merging of partial FDOs. This can be achieved by combining JSON queries and JSON Schemas, but indicates that it may be more beneficial to have FDO fragments as separate objects. Adding openDS fragment steps would however complicate workflows.
Several of our tools process the referenced images, currently https URLs in openDS. We added a caching layer to avoid repeated image downloading, coupled with local file-paths wiring in the workflow. A similar challenge occurs if accessing image data using DOIP, which unlike HTTP, has no caching mechanisms.
RO-Crate lessons
Galaxy is developing support for importing and exporting Workflow Run Crates, a profile of RO-Crate (
Our prototype de novo workflow returns results as a ZIP file of openDS objects. End-users should also get copies of the referenced images and generated visualisations, along with workflow execution metadata. We are investigating ways to embed the preliminary Galaxy workflow history before the final step, so that this result can be an enriched RO-Crate.
Conclusions
SDR is an example of machine-assisted construction of FDOs, which highlight the needs for intermediate digital objects that are not yet FDO compliant. The passing of such “local FDOs” is beneficial not just for efficiency and visual inspection, but also to simplify workflow composition of canonical workflow building blocks. At the same time we see that it is insufficient to only pass FDOs as JSON objects, as they also have references to other data such as images, which should not need to be re-downloaded.
Further work will investigate the use of RO-Crate as a wrapper of partial FDOs, but this needs to be coupled with more flexible FDO types as profiles, in order to restrict “impossible” ordering of steps depending on particular inner FDO fragments. A distinction needs to be made between open digital specimens that are in “draft” state and those that can be pushed to DiSSCo registries.
We are experimenting with changing the SDR components into Canonical Workflow Building Blocks (
FDO, research object, RO-Crate, computational workflow, Galaxy, openDS, specimen, digitization
Stian Soiland-Reyes
First International Conference on FAIR Digital Objects, poster
We acknowledge the SYNTHESYS+ and DiSSCO project members who have been invaluable in early evaluation and feedback on the development of SDR.
Author contributions to this article according to the Contributor Roles Taxonomy CASRAI CrEDiT: