Research Ideas and Outcomes :
Conference Abstract
|
Corresponding author: Nicolas Blumenröhr (nicolas.blumenroehr@kit.edu)
Received: 26 Aug 2022 | Published: 12 Oct 2022
© 2022 Nicolas Blumenröhr, Thomas Jejkal, Andreas Pfeil, Rainer Stotzka
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Blumenröhr N, Jejkal T, Pfeil A, Stotzka R (2022) FAIR Digital Object Application Case for Composing Machine Learning Training Data. Research Ideas and Outcomes 8: e94113. https://doi.org/10.3897/rio.8.e94113
|
The application case for implementing and using the FAIR Digital Object (FAIR DO) concept (
Data sets curated by different domain experts usually have non-identical label terms. This prevents images with similar labels from being easily assigned to the same category. Therefore, using them collectively for application as training data in ML comes with the cost of laborious relabeling. The data needs to be machine-interpretable and -actionable to automate this process. This is enabled by applying the FAIR DO concept. A FAIR DO is a representation of scientific data and requires at least a globally unique Persistent Identifier (PID) (
Storing typed information in the PID record demands a prior selection of that information. This includes mandatory metadata and a digital object type to enable machine interpretability and subsequent actionability. The information provided in the PID record refers to its PID Kernel Information Profile (PIDKIP), defined or selected by the creator of the FAIR DO. A PIDKIP is a standard that facilitates the definition and validation of the mandatory metadata attributes in the PID record. This information acts as a basis for a machine to decide if the digital object is reusable for a particular application. Part of that is also the digital object type, which enables a machine to work with the data represented by the FAIR DO. If more information is required, the data itself or other associated FAIR DOs need to be accessed through references in the PID record.
Specifying the granularity of the data representation, and the granularity of the metadata in the information record is not a fixed task but depends on the objective. Here, the FAIR DO concept is used for representing image data sets with their label metadata. Each data set contains multiple images, which refer to the same label term. One data set associated with a particular label is represented as one FAIR DO. A type that provides information about this entity covers the packaged format of the images and the image format itself. Further information about the label term and other metadata associated with the data set is provided or accessed through references in the PID record. For the PIDKIP, the Helmholtz KIP was chosen, following the RDA Working Group recommendations on PID Kernel Information (
The automated procedure for relabeling then looks as follows: A specialized client that can work with PIDs, resolves the PID of a FAIR DO which represents an image data set, and fetches its record. Analyzing its type, the client validates the data usability for composing a ML training data set. Furthermore, the referenced PID of the image label FAIR DO in the record is resolved the same way. By analyzing its PID record, the client identifies that it is relevant for getting information about the labels. The document represented by the image label FAIR DO is accessed via its location path provided in the PID record. To work with its content, a specialized tool is required that is compatible with its format and schema, i.e. its type. This tool identifies and analyzes the label term of the data set for mapping it to corresponding label terms of other image data sets.
This specification of FAIR DOs enables the relabeling of entire image data sets for application in ML. However, the current granularity of data representation is insufficient for other machine-based decisions and actions on single images. Another aspect in this regard is to increase the information in the PID record to enable more machine-actionable decisions. This requires reconsideration of the granularity of metadata in the PID record and needs to be balanced with the aim of fast record processing. Changing the content of the PID record also leads to deriving a new PIDKIP, or extending existing ones. Metadata tools applied in conjunction with the FAIR DO concept that uses the label information in the document of the metadata FAIR DOs need further specification. One requirement for their implementation is a standardized data description for the metadata document, using schemas and vocabularies.
Using the machine actionability of FAIR DOs described above, enables automation for relabeling data sets. This leaves more time for the ML user to concentrate on model training and optimization. Software development of FAIR DO-specific clients and metadata mapping tools are the subject of current research. The next step is to implement such software, for carrying out the proposed concept on a large scale.
This work has been supported by the research program 'Engineering Digital Futures' of the Helmholtz Association of German Research Centers and the Helmholtz Metadata Collaboration Platform (
Persistent Identifier, Metadata, Image Data, Label
Nicolas Blumenröhr
First International Conference on FAIR Digital Objects, presentation