Research Ideas and Outcomes :
Research Article
|
Corresponding author: Rossella Aversa (rossella.aversa@kit.edu)
Academic editor: Francisco Andres Rivera Quiroz
Received: 27 Jun 2023 | Accepted: 17 Jul 2023 | Published: 22 Aug 2023
© 2023 Nicolas Blumenröhr, Rossella Aversa
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Blumenröhr N, Aversa R (2023) From implementation to application: FAIR digital objects for training data composition. Research Ideas and Outcomes 9: e108706. https://doi.org/10.3897/rio.9.e108706
|
Composing training data for Machine Learning applications can be laborious and time-consuming when done manually. The use of FAIR Digital Objects, in which the data is machine-interpretable and -actionable, makes it possible to automate and simplify this task. As an application case, we represented labeled Scanning Electron Microscopy images from different sources as FAIR Digital Objects to compose a training data set. In addition to some existing services included in our implementation (the Typed-PID Maker, the Handle Registry, and the ePIC Data Type Registry), we developed a Python client to automate the relabeling task. Our work provides a Proof-of-Concept validation for the usefulness of FAIR Digital Objects on a specific task, facilitating further developments and future extensions to other machine learning applications.
FAIR Digital Objects, Metadata Schemas, Vocabularies, Linked Data, Operations, Machine Learning
In Machine Learning (ML), representative training sets (e.g.
Once suitable data sets are found and accessed, they need to be further analyzed by the scientist in order to evaluate their usability for training a ML model on the chosen task. In particular, when dealing with supervised learning, additional preprocessing is required, e.g. the assignment of labels in image recognition methods. As the data is collected from different sources, these labels need to be aligned, e.g. according to semantically similar categories; this way, images can be grouped together and relabeled. After a sufficient amount of images has been collected and relabeled, the composed training data set can be further prepared for ML using other techniques, e.g. resizing, or rescaling. The previously described preprocessing steps may be laborious and time-consuming, as they are usually performed manually by scientists, preventing them from spending their valuable time on the actual ML task, i.e., model training and analysis of results.
A possible solution to overcome the heterogeneity of data and repositories, and to reduce the amount of manually performed actions, is to apply the FAIR principles (
In this work we present the application of FDOs to Scanning Electron Microscopy (SEM) images labeled with a term related to their content. In this context, we refer to metadata as either administrative (bundled in the information record and necessary to manage the data, e.g. to identify its format) or scientific (information about the data in the context of a specific scientific question, e.g. labels describing the image content, which are represented by FDOs themselves). We explain the design of the FDOs and the requirements for their implementation in this context. Finally, we discuss the benefits of representing SEM image data as FDOs to compose new training data sets for further ML applications.
The FDO concept, which provides a generic framework, is not directly applicable as it covers a broad scope, requires contextual interpretation, and domain-specific expertise along with decision-making. Therefore, it must be implemented into an architecture of components that enable its use. An essential aspect that has been established in all FDO implementations is the employment of repositories as trustworthy data storage.
The approach of the Research Data Alliance (RDA) (
Further ways to model the FDO concept have been proposed: one of them is the FAIR Digital Object Framework (FDOF). Its implementation was shown in the frame of a test case from NFDI4DS, where ML data components (e.g. images as training data, source code, and publications) were connected to each other by representing them as FDOs (
In order to show the advantage of the FDO concept when the data is distributed in different repositories, we created two data sets starting from the NFFA-Europe – Majority SEM Dataset (
In the original data set (
We created several FDOs to represent the data landscape, i.e., images and labels, as well as their types (Fig.
To fill the information record of the image- and label- FDOs, we used the Helmholtz KIP (
In our implementation, each image- and label- FDO has a reference (i.e., PID) to a type- FDO, which contains information such as MIME-type, related metadata schema, and version. The type- FDO, which is based on the File Type KIP (
We implemented a Python client (
In our specific use case of the relabeling task, w e provided the PIDs of the image FDOs to the client, which then performed the sequence of operations shown in Fig.
Without the FDO representation, the metadata attributes were seldom machine-readable or collected in metadata documents describing the images and the labels, which were only available in the metadata repository. Moreover, it was possible to reference the image location from the metadata repository, but not vice versa. Having represented images and labels as FDOs, their administrative metadata are machine-readable and -interpretable, and include attributes relating the corresponding data objects to each other, regardless of their location. With this representation, a client can perform operations on them without the prerequisite for any changes to the original data, contributing to an automated composition of a ML training data set of SEM images.
Our client exploits the FDO data representations to retrieve the SEM images and to relabel them, i.e., to assess the relations between different label terms, based on the machine interpretation of the UNESCO Thesaurus concept definitions. To successfully perform this task, the information record of the FDO must contain at least the following attributes: "checksum" as a means of verification, "type" to call the appropriate method, "license" to evaluate whether it is allowed to use the data, "topic" to access the relevance of the data for a given task before downloading it or further processing its corresponding information record, "location" to access the SEM images and the JSON documents, "hasMetadata" to retrieve the information record of the label- FDO and "isMetadataFor" in turn to point to the image- FDO.
All attributes are part of, but not exclusive to, the Helmholtz KIP. Therefore, our client supports any FDOs, even based on other KIPs that contain the aforementioned required attributes. However, it must be noted that the relation assessment of the label terms was performed using the concept definitions from the UNESCO Thesaurus, an approach based on linked data. Metadata files that are based on other schemas and vocabularies will require additional implementations.
Our work successfully shows the feasibility of using FDOs in the context of an ML application case to automate the time-consuming relabeling task as part of the training data composition. It is worth remarking that our approach implements harmonized data descriptions using schemas and vocabularies to facilitate machine-readability and -interpretability. We decided to represent each data component, i.e., SEM images data sets and metadata documents containing the labels, as separate FDOs. This allows to use each FDO independently from this application case, and to link new FDOs to the already existing ones through their PIDs. The latter matches our scenario, where scientific metadata was added after data curation. The introduction of a type- FDO enables additional enrichment of FDO type-specific attributes in the information record, e.g., MIME-Type and metadata schema. Moreover, it facilitates a one-to-many relation, where all data components of the same type point to the same type- FDO, i.e., to the same PID.
The strength of our design stays in its high flexibility: the attributes in the FDO information records are standardized and reusable, being defined in a DTR; other metadata schemas and vocabulary specifications can be implemented to extend the FDO content; a client with modified features can be easily realized to perform different tasks with respect to the one presented in this work.
As a future perspective, interesting aspects can be explored: which particular attributes of the FDO information record are required to enable machine-actionable decisions in order to fulfill a given task? What is the most efficient level of granularity to represent the data in a given application case? Our FDO design can surely pave the way for further development and applications to support the answers to these questions and beyond.
This work has been supported by the research program 'Engineering Digital Futures' of the Helmholtz Association of German Research Centers and the Helmholtz Metadata Collaboration Platform. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101007417 within the framework of the NFFA-Europe Pilot (NEP) Joint Activities. We acknowledge support by the KIT-Publication Fund of the Karlsruhe Institute of Technology.