Research Ideas and Outcomes :
Research Article
|
Corresponding author: Ulrich Schwardmann (uschwar1@gwdg.de)
Academic editor: Francisco Andres Rivera Quiroz
Received: 28 Jun 2023 | Accepted: 10 Aug 2023 | Published: 30 Oct 2023
© 2023 Ulrich Schwardmann, Tibor Kálmán
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Schwardmann U, Kálmán T (2023) How FDO attributes can support machine- and human-readability? - a description along three examples. Research Ideas and Outcomes 9: e108737. https://doi.org/10.3897/rio.9.e108737
|
Based on the notion of a FAIR Digital Object (FDO) record, which consists of key-value pairs as attributes that are precisely defined in a Data Type Registry and selected in a profile, we show three examples of FDOs from different viewpoints how FDO records can be implemented as Handle PID records. As references to the attribute definitions, the keys determine the value space of the attribute. In the first two examples, the profiles enable human-readable keys and legacy digital objects to be integrated into FDO records. How legacy metadata from IANA media types that can be transformed into structured metadata of appropriate attribute definitions that then can be applied in profiles and FDO records, is described in the third example.
FAIR Digital Object, FDO record, FDO profile, PID for Instruments, DARIAH repository, MIME type
FAIR Digital Objects (FDOs) are typed by a well-described set of attributes, where attributes are key value pairs with a key that refers to a clear specification of the value. On one hand, such exact descriptions of the attribute keys allow machines to interpret values and, thus, they are crucial for machine actionability. On the other hand, a proper description of attributes can be a way to also enable human-readability of the used keys. Nevertheless, machine- and human-readability do not have to be mutually exclusive. Based on the outcome of the RDA working group on Data Type Registries (
This article is structured in the following way. First, we provide a brief overview of the FDO Record and FDO Profile utilised to structure attributes, as well as the generic concepts behind them. Following that, we shift our focus to three examples of FDO types with their attributes. Each of the examples highlight different viewpoints and approaches to build up FDOs.
Here, we introduce the prerequisites that are needed across all sections. We describe the FDO Record and FDO Profile related to structure attributes. We also present some generic concepts behind them.
The well-described set of attributes is integrated into an FDO Record, which is the core data abstraction that allows a machine to process any FDO in a deterministic manner. An FDO record is the structure that is returned when resolving a reference via a Persistent Identifier (PID).
The FDO Record consists of a set of attributes, each consisting of a key value pair (see Fig.
The structure of the FDO record needs to be described in a way that machines know what they can expect in the record and how they should operate on it. This description is called an FDO profile. It contains a list of needed and allowed attribute keys for which the FDO record provides values.
For machine actions also, this FDO profiles have schemata, the so-called profile schemata, which can be found in a schema repository as part of the FDO framework. Like the attribute definitions, both the FDO profiles, as well as the schemata for them, are again FDOs.
The following attributes should be provided by an FDO record for machine actionability:
A mandatory FDO profile attribute that describes the valid attributes in the FDO record.
A Data attribute allowing the client to access the data associated with the FDO.
An FDO Type attribute that enables the client to quickly determine the type of the data associated with the FDO.
A metadata attribute that allows the client to obtain the metadata describing the FDO record and the data with which it is associated.
Other optional community specific attributes: these are the FDO attributes that a client would look for with respect to the specifications given in the profile. It allows the client to obtain specific metadata that has an important role for community specific operations and selection processes.
Some of this community specific metadata information will be used for describing access control and licensing information. It is possible that multiple metadata attributes will be used to specify different aspects of the FDO record.
Even if not all of the requirements above are fulfilled by the examples below, we will show what has to done additionally to fulfil them.
In this article, we present three examples of FDO types with their attributes. Each of the examples highlight different viewpoints and approaches to build up FDOs.
The first use case (PIDs for instruments) and the second use case (DARIAH PID structure) are examples, where the values of all used attributes are provided inline in the FDO record itself. It is shown how one can integrate legacy attribute sets, providing inside repositories for the description of their digital objects and that are based on human-readable keys, into a machine readable setting.
The third use case (media types) is an example, where the values of the FDO record are provided as references to media type instances stored in a type registry. The differences of this approach and the automated ways of distinction between these two approaches are then discussed in the following. section.
We point out that, across the examples, there are some common characteristics selected with respect to the underlying technology:
the Handle system is used for the persistent identifiers.
the FDO record is provided by the Handle record of the PID.
all the provided attributes can be found here as type-data pairs in the phrasing of the Handle system.
In the following, we introduce the three examples in details.
The PID for instrument example goes back to the development of kernel metadata, which is seen as minimally required to reference and describe scientific instruments (
An example of such an attribute definition is that of a single manufacturer of an instrument*
The example we provided uses references to the attribute definitions as keys for the values which are often lists or objects. The Handle Record of a PID for Instruments*
The structure of this Handle record complies to the description an FDO record in Fig.
However, since the relationship between the uniquely-defined attribute definitions and their semantical names is also provided by this profile, we can use these names in the Handle record as keys. In this way, an FDO record*
In both cases, the full instrument descriptions are completely stored in the Handle database of the Handle PID service. The PID itself is a metadata object and can be seen as an FDO of its own.
This example fulfils almost completely the requirements for an FDO record. The profile almost completely describes the existing Handle record. The only attribute that is missing in the profile derived from the metadata for instruments in [3] is that with key URL. URL is a generic Handle attribute for resolution, here used for the reference to a landing page with additional metadata about the instrument. It points to an XML file, but this type notification is missing in the record as well. Without the URL attribute, the Handle does not redirect to a resource, but the requirements for an FDO record would be fulfilled.
Another option is to store the metadata objects of instrument descriptions in repositories. In this case, a schema is needed to describe the metadata elements that are needed for the description. The existing attribute definitions can be bundled into a single complex type definition and, for this definition, the profile for instruments above can be used. Even if profiles are conceptually different from attribute definitions, they are syntactically similar and, therefore, exploitable by the same services.
From this complex profile - interpreted as a type definition - one can derive a schema for the repository entries. The result of such a schema derivation can then be fed into the ingest module of a repository, for example, as Properties4Instruments schema*
As a next step, PID for instrument objects can be defined in the repository. (an example of such an object is here*
This example evolved in the digital humanities in the context of the German DARIAH project (
It uses key value pairs with human-readable keys as the type and provides relatively atomic values. The key here is a human-readable description for the value space that can be expected.
The use of human-readable keys, however, does not match the goal of machine readability of this desciption and additionally has the risk of uncertainty and ambiguity.
In order to make these attributes machine readable, attribute definitions for the allowed value spaces have been stored in the ePIC data type registries. The basic information type for an email address*
Attribute definitions for all the other attributes used in the DARIAH example are also provided by the ePIC data type registries.
In this way, one is able to define a profile for the legacy DARIAH Handle records*
If this profile is the known profile of all objects in the DARIAH repository, the named references to the keys in the profile disambiguate the human-readable form of the Handle record.
Usually - and as we have seen in the previous PID4Inst example - the profile of the FDO would be another attribute of the FDO. This would require an adaption of the attributes of all digital objects of the DARIAH repository. Since all digital objects of the DARIAH repository follow the same profile and all its digital objects have the same PID prefix, it would be sufficient to implement this additional attribute at the prefix level. Together with a rule that attributes on a lower level dominate attributes on a higher level, this additional prefix attribute would make FDOs out of legacy digital objects that have been defined a long time ago.
Additionally, this example fulfils almost completely the requirements for an FDO record. Again, the URL attribute is not covered by the profile and the type of the referenced data, in this case, an HTML landing page, is not specified.
This example is based on the Internet Assigned Numbers Authority IANA media types*
A IANA media type is given as a subtype of one of the content type directories application, audio, font, example, image, message, model, multipart, text or video and is described by a reference and a set of metadata attributes that are provided for each media type in a special file. Although this is an abuse of terminology, this file is called "template", because it is based on a template that describes a couple of key words that have to or should be filled in with unstructured text.
However, no template is provided for older formats like gif, jpeg, mpeg, plain or richtext. Here, the given references are pointing to RFC2045 and RFC2046 which are descriptions of Multipurpose Internet Mail Extensions (MIME). These media types are mentioned there, but they are neither standardised nor described.
The references for IANA media types point, in two thirds of the cases, to mail addresses of contact persons. Here the "template" is the only publicly available resource of information about the standard. Another less then a third of the cases refer to publicly available and human-readable RFCs. The small remainder refers to standards of other organisations without a directly referred resource of information about the standard.
Two options are available to provide attributes for IANA media types in FDO records. The first is a type definition as an enumeration of all IANA subtypes of the content types. An example for a controlled vocabulary*
Type definitions for the other content types can be described in a similar way. Here, a media type is given by an attribute, consisting of the reference to this type definition as key and a value out of the enumeration list of this type definition. This approach completely relies on the media type naming of IANA and does not provide information about any of the in-parts valuable additional metadata given by the IANA resources.
A second approach is a type definition for media types that covers the metadata as they are given by IANA, which is the information given by the "templates" and the IANA media type resource ([8]). In this case, a media type is an attribute consisting of the type definition for this metadata structure and a value that refers to a registered metadata record containing the information about a IANA media type. These references given as values are PIDs that represent the media type registered as FDO records in data type registries.
The metadata elements in IANA "templates" are described in details at RFC 6838*
However, if such a "template" is provided and it contained at least the keywords of the template, much of the information provided can be filtered out in an automated way (with some heuristics). This information can then be filled in a structured and machine-readable set of metadata elements for the media types that can be registered as an FDO representation of a IANA media type.
The structure for the registered metadata information is inspired by the keywords of the IANA template and some other information that can be extracted from the IANA media type presentation.
The content of this media-type metadada contains values for contact information, required or optional parameters, security, fragment identifier and interoperability considerations, applications, intended usage and restrictions on usage. There is additional information provided with deprecated alias names for this type, like magic number(s), file extension(s), Macintosh File Type Code(s) and Object Identifier(s) (OIDs). Additionally, in some cases, the encoding information is provided with four possible values: "7-bit", "8-bit", "binary" or "framed".
The structure is given by the type definition media-type-attributes*
All more than two thousand IANA media types are defined as types in the data type registry, based on this schema and can be viewed by a search on type:"media-type-schema"*
In this way, the second approach with a type definition, based on IANA type metadata, has the advantage that it provides a structured access to some useful additional information beyond the media (sub-)type name. For instance, related or describing RFCs and other standards, responsible authorities, used file extensions and dates for registration, publishing and updates and other provenance information, as well as some rudimentary information about the encoding, can be found there.
These IANA media types in the data type registry now can be used as values in attributes of FDO records, represented by the PID of their definition. The key of such attributes would be the reference to the definition of their type media-type-attributes with PID 21.T11148/bc0e376cd6a01a2f8071. Therefore, here the question whether the Handle record fulfils the requirements for an FDO record is not relevant, because the example does not describe an FDO record itsself.
This metadata information does, however, not describe the encoding in such a detail that it enables automated operations working on the bitstream structure of the typed data itself.
Automated processing of the data only works for specialised operations that are known to work on the basis of the media (sub-)type names or the PIDs referring to the registered metadata belonging to these media (sub-)type names. This is the case for both approaches of type definitions, either controlled vocabulary or registered metadata.
The question, which operations can and should be used for further processing, is answered in current user environments by special knowledge or user specific configuration of the software context. Browsers, for instance, often have specific configurable settings that determine the use of operations depending on file extensions.
A similar context-specific approach with preselected operations is possible for FDOs depending on media type attributes. The operations that work on data with a given media type attribute in the FDO record still need to be built by humans with the deeper understanding of the human-readable specification documents (RFCs) that describe the bitstream structure.
Ideally, it should also be possible to ask operation-repositories for suggestions of suitable operations depending on their media type attributes and the specifics of a certain software context. How such an operation infrastructure should be set up still has to be discussed and developed.
As described above, the media type is given in an FDO record as an attribute with key. The key is either type definition for the controlled vocabulary of media (sub-)type names or it is the type definition for the media type metadata. As value, it is either an instance of a media (sub-)type name or a PID reference to an instance of media type metadata. Here, we suggest to use the latter option, because it can provide more information. If additional information is needed on the reference, like the media (sub-)type name or used RFCs etc., then an additional attribute with appropriate inline values needs to be used.
In any case, the usage of either attribute needs to be announced in the profile for the FDO, such that an automated process knows in advance, what it can expect as information on the reference level and how it can achieve further information, if necessary.
The approach of a type definition for the media-type metadata can also be applied to similar data-use cases that do not have a IANA media type standard for its data encoding (see Fig.
In such cases, a couple of metadata keys should be defined as mandatory and especially a reference to the encoding description should be made.
In a further step into the direction of automation, the encoding description itself should be made explicitly machine readable, as it is already possible with a couple of such approaches for data serialisation systems, like FITS, ProtoBuf, Avro, Parquet, Arrow and HDF5.
We highlighted different approaches to build up FDOs with three examples. Each of the examples presented FDO types with their attributes. The “PIDs for instruments” use case and the “DARIAH PID structure” example provided the values of all used attributes inline in the FDO record itself. Furthermore, the “DARIAH” use-case also showed how FDOs can be made out of legacy digital objects, where legacy attribute sets were provided a long time ago. We also described, how machine- and human-readability can be supported by FDO types at the same time. Our third use-case, the “media types” example, described FDO types, where the values of the FDO record are provided as references to media type instances stored in a type registry.