Research Ideas and Outcomes :
Research Article
|
Corresponding author: Alex Hardisty (hardistyar@cardiff.ac.uk)
Received: 14 May 2020 | Published: 18 May 2020
© 2020 Alex Hardisty, Hannu Saarenmaa, Ana Casino, Mathias Dillen, Karsten Gödderz, Quentin Groom, Helen Hardy, Dimitris Koureas, Abraham Nieva de la Hidalga, Deborah Paul, Veljo Runnel, Xavier Vermeersch, Myriam van Walsum, Luc Willemse
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Hardisty A, Saarenmaa H, Casino A, Dillen M, Gödderz K, Groom Q, Hardy H, Koureas D, Nieva de la Hidalga A, Paul DL, Runnel V, Vermeersch X, van Walsum M, Willemse L (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure - DELIVERABLE D8.1. Research Ideas and Outcomes 6: e54280. https://doi.org/10.3897/rio.6.e54280
|
DiSSCo, the Distributed System of Scientific Collections, is a pan-European Research Infrastructure (RI) mobilising, unifying bio- and geo-diversity information connected to the specimens held in natural science collections and delivering it to scientific communities and beyond. Bringing together 120 institutions across 21 countries and combining earlier investments in data interoperability practices with technological advancements in digitisation, cloud services and semantic linking, DiSSCo makes the data from natural science collections available as one virtual data cloud, connected with data emerging from new techniques and not already linked to specimens. These new data include DNA barcodes, whole genome sequences, proteomics and metabolomics data, chemical data, trait data, and imaging data (Computer-assisted Tomography (CT), Synchrotron, etc.), to name but a few; and will lead to a wide range of end-user services that begins with finding, accessing, using and improving data. DiSSCo will deliver the diagnostic information required for novel approaches and new services that will transform the landscape of what is possible in ways that are hard to imagine today.
With approximately 1.5 billion objects to be digitised, bringing natural science collections to the information age is expected to result in many tens of petabytes of new data over the next decades, used on average by 5,000 – 15,000 unique users every day. This requires new skills, clear policies and robust procedures and new technologies to create, work with and manage large digital datasets over their entire research data lifecycle, including their long-term storage and preservation and open access. Such processes and procedures must match and be derived from the latest thinking in open science and data management, realising the core principles of 'findable, accessible, interoperable and reusable' (FAIR).
Synthesised from results of the ICEDIG project ("Innovation and Consolidation for Large Scale Digitisation of Natural Heritage", EU Horizon 2020 grant agreement No. 777483) the DiSSCo Conceptual Design Blueprint covers the organisational arrangements, processes and practices, the architecture, tools and technologies, culture, skills and capacity building and governance and business model proposals for constructing the digitisation infrastructure of DiSSCo. In this context, the digitisation infrastructure of DiSSCo must be interpreted as that infrastructure (machinery, processing, procedures, personnel, organisation) offering Europe-wide capabilities for mass digitisation and digitisation-on-demand, and for the subsequent management (i.e., curation, publication, processing) and use of the resulting data. The blueprint constitutes the essential background needed to continue work to raise the overall maturity of the DiSSCo Programme across multiple dimensions (organisational, technical, scientific, data, financial) to achieve readiness to begin construction.
Today, collection digitisation efforts have reached most collection-holding institutions across Europe. Much of the leadership and many of the people involved in digitisation and working with digital collections wish to take steps forward and expand the efforts to benefit further from the already noticeable positive effects. The collective results of examining technical, financial, policy and governance aspects show the way forward to operating a large distributed initiative i.e., the Distributed System of Scientific Collections (DiSSCo) for natural science collections across Europe. Ample examples, opportunities and need for innovation and consolidation for large scale digitisation of natural heritage have been described. The blueprint makes one hundred and four (104) recommendations to be considered by other elements of the DiSSCo Programme of linked projects (i.e., SYNTHESYS+, COST MOBILISE, DiSSCo Prepare, and others to follow) and the DiSSCo Programme leadership as the journey towards organisational, technical, scientific, data and financial readiness continues.
Nevertheless, significant obstacles must be overcome as a matter of priority if DiSSCo is to move beyond its Design and Preparatory Phases during 2024. Specifically, these include:
Organisational:
Technical:
Scientific:
Data:
Financial
Strategically, it is vital to balance the multiple factors addressed by the blueprint against one another to achieve the desired goals of the DiSSCo programme. Decisions cannot be taken on one aspect alone without considering other aspects, and here the various governance structures of DiSSCo (General Assembly, advisory boards, and stakeholder forums) play a critical role over the coming years.
DiSSCo, Distributed System of Scientific Collections, Design, Blueprint, ICEDIG, Deliverable
Strategically, it is vital to balance multiple factors – technical and engineering, organisational and political, financial and legal, and operational and governance – against one another to achieve the desired goals of the DiSSCo programme. Decisions cannot be taken on one aspect alone without considering other aspects.
DiSSCo, the Distributed System of Scientific Collections, is a pan-European Research Infrastructure (RI) mobilising, unifying and delivering bio- and geo-diversity information connected to the specimens held in natural science collections to scientific communities and beyond. Bringing together 120 institutions across 21 countries and combining earlier investments in data interoperability practices with technological advancements in digitization, cloud services and semantic linking, DiSSCo makes the data from natural science collections available as one virtual data cloud in association with a wide range of end-user services that begins with finding data, accessing data, using data and improving data. New services, made possible by the expanding variety and volume of data, coupled with open access to that data and new ways of connecting and manipulating it, will transform the landscape of what is possible in ways that are hard to imagine today. In addition to serving as the framework for interpreting and validating species data, DiSSCo connects historical collection data – traditionally connected using scientific species names as primary identifiers – with data emerging from new techniques and not already linked to species names. These new data include DNA barcodes, whole genome sequences, proteomics and metabolomics data, chemical data, trait data, and imaging data (Computer-assisted Tomography (CT), Synchrotron, etc.) to name but a few. DiSSCo will deliver the diagnostic information required for novel approaches and technologies for accelerated field identification of species, contributing to the development of datasets at adequate scale to support regular environmental monitoring, trend analysis and future prediction. The human discoverability and accessibility of the DiSSCo knowledge base will enable researchers across disciplines to tap into a previously inaccessible pool of quality assured data, while the machine readability will enable users to automatically digest these datasets into analytical workflows and tools.
With approximately 1.5 billion objects to be digitized, bringing natural science collections to the information age is expected to result in 90 petabytes of new data over the next decades, used on average by 5,000–15,000 unique users every day. This requires new skills, clear policies and robust procedures to create, work with and manage large digital datasets over their entire research data lifecycle, including their long-term storage and preservation and open access. Such processes and procedures must match and be derived from the latest thinking in open science and data management, as epitomised by the two quotations at the beginning of this section concerned with realising the core principles of 'findable, accessible, interoperable and reusable' (FAIR).
The present document, the DiSSCo Conceptual Design Blueprint is based on the results of the ICEDIG project work. It has been synthesised from the essential outcomes and conclusions of that project. These are mostly (but not all) reported in other deliverable documents. The present document constitutes the essential background needed by the DiSSCo Prepare project to carry out its work to raise the overall maturity of the DiSSCo Programme across multiple dimensions (organisational, technical, scientific, data, financial) to achieve readiness to begin construction.
The present document is a ‘blueprint’ report covering the technological design innovations, organisational consolidation, partnership, governance and business model proposals for constructing the digitization infrastructure of DiSSCo. In this context, the digitization infrastructure of DiSSCo must be interpreted as that infrastructure (machinery, processing, procedures, personnel, organisation) offering Europe-wide capabilities for mass digitization and digitization-on-demand, and for the subsequent management (i.e., curation, publication, processing) and use of the resulting data.
By mass digitization, we mean digitizing entire collections or their major distinct parts at industrial scale (i.e., millions of objects annually at low cost (e.g., < c.€0.50 per item), characterised by improved workflows, technological and procedural frameworks based on automation (both hardware and software) and enrichment (link-building). This is critical within DiSSCo to mobilise the data from collections as rapidly as possible, so that these data can be more easily found and used; and can act as an anchor or ‘keyring’ for other data.
Alongside mass digitization, DiSSCo also recognises the need for digitization-on-demand; by which we mean processes, procedures and technologies that respond to and support a request for digital data about specimens that have not yet been through a mass digitization process or project-based digitization workflow. Such requests may sometimes be supported by prioritisation within a mass digitization workflow, or they can help institutions to refine and pilot new mass workflows. But such requests can also demand more data than the ‘normal workflow’ to meet specific needs of the users in each case or to deal with digitization of the most complex specimens and preservation types. Digitization-on-demand is a critical channel for digital access as envisaged for this research infrastructure, to allow truly global access to and use of collections. Mass digitization and digitization-on-demand are treated in detail in section 3.
The present document emphasises that the considerations and recommendations (104) in the following sections (and collated as a summary list for convenience in Appendix A for convenience), and the conclusions in section 7 are a balance of multiple factors in play (Fig.
Practically, this means focussing on the arrangements, processes and practices that need to be in place to achieve mass digitization, the architecture, tools and technologies to make digitization and management of the resulting data least-cost, and the culture, skills and capacity-building essential to operating in an efficient and sustainable manner. The present document is structured this way, and the recommendations coming from this practical focus contribute towards increasing overall implementation readiness*
The terms below are used with the meanings given:
Innovation: The alteration of what is established by the introduction of new elements or forms; a change made in the nature or fashion of anything; something newly introduced; a novel practice, method, etc.
Consolidation: The action of making solid, or of forming into a solid or compact mass (also figuratively); solidification; combination into a single body, or coherent whole; combination, unification.
DiSSCo Hub: The infrastructure of integrating services, information technology components (hardware and software), human resources, organisational activities, governance, financial and legal arrangements that collectively have the effect of unifying natural science collections through a holistic approach towards digitization of and access to the data bound up in those collections.
DiSSCo Facility: The geographically distributed collection-holding organisation(s) (i.e., natural science/history collection(s)) and related third-party organisations that deliver data and expertise to the DiSSCo Hub infrastructure, and which can be accessed by users via the DiSSCo Hub infrastructure.
A special kind of DiSSCo Facility could be a DiSSCo Centre of Excellence (DCE), specialised in one or more of researching, innovating, developing and operating/performing techniques and/or processes of digitization or other related facets, and disseminating information on same.
For the meaning of other terms and abbreviations, see the glossary (section 9).
Main recommendations of the report are contained in double-bordered boxes, like so:
Recommendation N: The words of the recommendation.
Supplementary information not integral to the recommendations of the present report but of interest to the reader is contained in single-bordered boxes, like so:
Supplementary information. The words of the supplementary information.
References to other documents include references to ICEDIG project deliverable documents of the form “ICEDIG project deliverable Dx.y”. These are referred to via endnotes and are listed in full in section 8.
References to other documents are cited in the text in square brackets []. These are listed in the References section.
Digital transformation of society affects all areas of human activity, and science is at the forefront of this development: “Digital science means a radical transformation of the nature of science and innovation due to the integration of ICT in the research process and the internet culture of openness and sharing”. Emphasis on sharing the results and data of publicly funded research has led to the concept of “Open Science” [
Natural science collections are an integral part of the global natural and cultural capital. They include 2-3 billion animal, plant, fossil, rock, mineral, and meteorite specimens. The European collections account for 55% of the natural sciences collections globally and represent 80% of the world’s bio- and geo-diversity. Data derived from these collections underpin countless innovations, including tens of thousands of scholarly publications and official reports annually (used to support legislative and regulatory processes relating to health, food, security, sustainability and environmental change); inventions and products critical to our bio-economy; databases, maps and descriptions of scientific observations; instructional material for students, as well as educational material for the public. Natural science collections, which exist in all the world’s countries, are some of the oldest Research Infrastructures (RI). Their collections have always been open for all scientists and form the hard core of biodiversity science that studies the existence of life on earth and geoscience that studies the earth itself. Life sciences pioneered open access through establishing the GenBank in 1982. In 2001, the OECD Megascience Forum established the Global Biodiversity Information Facility (GBIF) to share biodiversity data freely and openly. Initially, GBIF focussed on digitization of biological collections, but is now an important source of any kind of primary biodiversity observation data.
Digitizing biological collections, which have been gathered for more than 250 years, is a gargantuan task. Europe’s collections are home to about one and a half billion specimens, of which about 10% have been digitally catalogued. For the remaining collections, only the physical objects exist, containing – besides the biological material itself – highly useful biodiversity data attached to each specimen on paper labels. Only about 1-2% of the objects have been imaged. User demand for the available data is high. For example, in 2016, Royal Botanic Gardens, Kew’s herbarium catalogue (which at that time had over 900,000 specimen records available) was visited over 125,000 times, with more than 960,000 page-views.
The current situation hinders modern science, as contemporary research requires access to data digitally to address some of the biggest challenges of our time e.g., [
One part of the solution is already in place. The e-infrastructure for accessing biodiversity data already exists through GBIF, as well as through other related initiatives such as the Catalogue of Life (CoL), the Biodiversity Heritage Library (BHL), the Encyclopedia of Life (EOL), the International Barcode of Life (iBOL), and the Biodiversity Information Standards organisation (TDWG). These global organisations each tackle their own special data types and services. For instance, GBIF gives harmonised access to over 50,000 datasets with more than 1.3 billion biodiversity records*
There is a serious disconnect between the e-infrastructure and the physical infrastructure. Museums do have keen interest in digitizing their collections [
There is evidence that industrialising digitization of natural science collections is achievable [
Pioneered by the Herbarium of Muséum National d’histoire Naturelle (MNHN Paris), outsourced digitization activities have been undertaken in several innovative programmes, including those of Digitarium in Finland, Plantentuin Meise in Belgium, Naturalis in the Netherlands, and across France (ReColNat project) and Norway. Further major initiatives are being proposed, for instance in Germany, Sweden, and the UK, and similar approaches are also used in the USA for mass digitization, for example by the Smithsonian Museum of Natural History. Both inhouse and outsourcing approaches at scale have now been implemented and tried across several projects.
Total spending in known European projects is in the range of €10-20 million annually. In the USA, the Advancing the Digitization of Biological Collections (ADBC) program receives $10 million annually from the National Science Foundation for nationally distributed digitization of non-federal collections. Collections in the USA write grants to compete for these digitization funds. iDigBio receives about a fifth of those yearly funds to aggregate the digitized data in a central infrastructure. In addition, iDigBio works collaboratively with the worldwide community to build capacity development for digitization and data use, foster research use of these data, develop enhanced education and outreach materials using collections data, and lead efforts to increase inclusivity in museum collections. The majority rely on major public investments. A problem is that these projects are funded in the short-term. Sustainability can only be achieved when the results are anchored in long-lived RIs, and the new ways of work are internalised by stakeholders. These initiatives will need a common infrastructure to effectively share their technologies, data, and experiences, and jointly build the most demanding functions. This will further accelerate the process and bring in more resources from more countries, when the task becomes tangible. Furthermore, it is crucial to develop strong links between physical and virtual access programmes. Complementarity between physical and virtual access at European level enhances the overall capacity of researchers to discover and retrieve relevant information.
However, in most countries there are no systematic mass digitization programmes. There is lack of funding, lack of skills, workflows able to cope only with low throughput, and lack of suitable ICT systems. This makes the unit cost of digitization too high for rapid mass digitization. The existence of a large research infrastructure that tackles digitization could change this. In other words, to be viable, digitization requires large volumes to become more affordable.
It’s evident there are many understandings of what is meant by digitization, beyond the fundamental definition of being the process of converting analog information about physical specimens to digital format (which includes electronic text, images and other representations). Does it mean creating a database record in a collection management system (CMS)? Does it mean producing one or more photographs of a specimen, and if so, with what characteristics? Does it include transcription of label data into database records? Does it include more than just making the ‘what, where, when and who’ of the gathering event available as digital data? Different digitization initiatives can (and typically do) have varying aims and scope, leading to quite different outcomes in the characteristics of the resulting data sets. This can lead to different uses and benefits and therefore different assessments of usefulness (fitness for purpose) and cost-effectiveness. Small scale, deep digitization can certainly be cost-effective (i.e. offer good value for money in relation to project aims), but it may not be affordable at scale.
Additionally, as digitization proceeds and more varied uses are made of digital specimen data, we are beginning to find out more about what makes such data most useful. This itself is likely to affect what we consider essential data to digitize over time.
DiSSCo data management and stewardship will take place in the context of a wide and extraordinarily varied and complex landscape of data generation and capture, management and use initiatives that extend from the local practices of individual researchers/groups, through institutional and national activities to extensive European and other regional initiatives that collectively make up the global landscape of biodiversity, ecological, and geo-environmental informatics. The range of stakeholders is broad, with many missions and interests, sometimes overlapping. Nevertheless, at present (late 2019) there is hardly any coordinated management of digital specimen and collection data outside of the historical and traditional institutions in all countries charged with caring for collections of physical natural science specimens i.e., the natural science collection-holding institutions such as national and local museums.
The activities of the Global Biodiversity Information Facility (GBIF), whose main mission is the mobilisation of primary biodiversity data are of great significance for DiSSCo. Here, DiSSCo eventually acts as the key stakeholder representing European collections that contribute primary occurrence data to GBIF based on specimens preserved in collections*
Regional initiatives taking place elsewhere in the world are also of significance. These include:
The importance of these initiatives extends beyond exploiting their experience and best practice. Global level alliances and coordination, especially global coordination around persistent identification of Digital Specimens and Digital Collections is essential and beneficial; for example, in working towards:
Against the foregoing background, the general objective of the ICEDIG project has been to lay the ground work for such an organisation, namely DiSSCo, firstly by identifying and recommending the technological innovations that will be needed to efficiently digitize one and a half billion collection objects (including subsequent management of their data) in a foreseeable time, such as the next 30 years; and secondly to identify organisational consolidations needed to perform this task. When this has been achieved, the natural science community will be a fully enabled player in digital society, and the most fundamental scientific data on the diversity of our planet will be freely and openly available for all.
Table 1 – Table 4 below highlight the innovations and consolidations the various tasks of the ICEDIG project have identified and recommended in their detailed work in each of four areas of consideration. These areas are each considered in detail respectively in sections 3 – 6 below and are:
In Tables
Description of the recommended innovation or consolidation |
I/C |
A/W |
ICEDIG WP |
DiSSCo Prepare WPs |
Comment |
DiSSCo related Policy Analysis |
C |
A |
7 |
||
Related research infrastructure landscape |
C |
A |
7 |
||
Alignment of related research infrastructure |
C |
W |
7 |
||
Procedures for developing standards and their review and adoption |
I |
W |
4 |
||
Principles for engaging citizen scientists |
C |
W |
5 |
||
Human transcription of specimen label data remains the most accurate way to obtain structured data for each specimen, though automated systems show potential in large volumes and for certain types of data |
C |
A |
4 |
||
Collection preparation for digitization is an essential element in the digitization process, though it is easily overlooked and underestimated |
C |
A |
3 |
||
Keep digitization processes as lean as possible. Accommodating all exceptions and other collection preservation tasks hinders the rate of digitization |
C |
A |
3 |
||
Use imaging of labels to later enrich MIDS level from 0-1 to 2-3 and data validation |
C |
A |
3 |
See table 2 for entomology case. |
|
DiSSCo should keep an overview of ongoing development and innovations in digitization workflows and technology (knowledge base) |
I |
W |
1/9 |
||
Collection Digitization Dashboard (CDD) is an extremely useful tool for high-level decision-making processes. It facilitates catalyzing categories and different levels of granularity per user community |
I |
2 |
8 |
||
DiSSCo should develop a template to record cost and efficiency of digitization workflows to allow real comparisons |
I |
W |
8 |
Innovation/consolidation |
I/C |
A/W |
ICEDIG WP |
DiSSCo Prepare WPs |
Comment |
Adopt Digital Object Architecture (DOA) as basis for DiSSCo data management |
I |
A |
6 |
||
Nine characteristics to be protected throughout DiSSCo lifetime |
C |
A |
6 |
||
Store annotations separately |
I |
A |
6 |
Same principle can be applied to other object types/transactions of DiSSCo, such as ELViS transactions. |
|
Use Handles as persistent identifiers |
I |
W |
6 |
There is still an open issue as to whether these should be NSId, IGSN or something else. |
|
Separate authoritative and supplementary information about a specimen |
I |
A |
6 |
Notion that trusted/approved experts outside of owning institution can modify authoritative information. |
|
Root the definition of DiSSCo / openDS object model in Biological Collections Ontology (BCO) |
C |
W |
6 |
Implies extending BCO |
|
Base the definition of DiSSCo / openDS object model on ABCD 3.0 with EFG, ensuring alignment and compatibility with DwC |
C |
W |
6 |
Open issue is how to deal with alternate terms used by ABCD and DwC for the same concept. openDS specification should contain a mappings annex. Take note of ABCD/DwC convergence work proposed in TDWG. |
|
Support versions of Digital Specimen objects, with accompanying annotations and interpretations of verbatim data |
I |
W |
6 |
||
Pipeline for automated text digitization and entity recognition |
C |
W |
4 |
||
Review of automated georeferencing methods |
C |
W |
4 |
5 |
|
Recommendations for collection management systems |
C |
W |
4 |
6 |
|
Benchmark dataset of herbarium specimens |
I |
A |
4 |
5 |
|
Interoperability analysis of published specimen data |
C |
A |
4 |
5 |
|
Recommendations for improved standards for specimen data transcription |
C |
W |
4 |
5 |
|
Efficient semi-automated systems for entomology data and image capture |
I |
W |
3 |
In development but need further improvement to digitize the bulk of NH specimens. See D3.5. |
Innovation/consolidation |
I/C |
A/W |
ICEDIG WP |
DiSSCo Prepare WPs |
Comment |
Comparison of text digitization by humans |
I |
W |
4 |
||
Training in data management, such as the FAIR principles will help researchers take good decisions |
|||||
Citizen scientists can play and important role in the creation of data about collections. |
C |
W |
2, 5 |
||
Functional Units approach towards skills to more flexibly manage them by reorganising the competencies in respect to the existing capacities per institution |
I |
W |
8 |
3 |
See MS48, MS49 |
Recurrent training is urgently needed throughout the entire data cycle and digitization process to allow effective implementation |
C |
8 |
2 |
Digitization is the process of making data about physical objects digitally available, and the output of that process in the DiSSCo context is Digital Specimens and Digital Collections. Digital Specimens and Digital Collections are specific types of ‘digital objects’, which are the fundamental entities to be the subject of data management in DiSSCo. Each instance of a digital object collects and organizes core information about the physical things it represents. These identified objects are amenable to processing and to transport from one information system to another. A persistent link must be maintained from the Digital Specimen to the physical specimen it represents, and which acts as a voucher*
Each Digital Specimen or other digital object instance handled by the DiSSCo infrastructure must be unambiguously, universally and persistently identified by an identifier (Natural Science Identifier, NSId) assigned when the object is first created. Each DiSSCo Facility must be responsible for creating (minting) and managing their own NSIds in accordance with the DiSSCo policy for NSIds, and for registering their own Digital Specimens with the DiSSCo Hub infrastructure. Resolution of an NSId must always return the current version of an object’s content, as well as any interpretations and annotations associated with it.
Digital Specimen and Digital Collection objects are examples of the fundamental unit – a digital object – that is manipulated and managed by Digital Object Architecture (DOA), a powerful yet simple extension of the existing Internet. Such objects are treated as mutable objects with access control and object history (provenance) meaning they can be updated as new knowledge becomes available. Provenance data must be generated and preserved by operations acting upon DiSSCo data objects. Timestamped records of change allow reconstruction of a specific ‘version’ of a digital object at a date and time in the past.
Information about Digital Specimens and Digital Collections must be published and managed as part of the European Collection Objects Index (ECOI). Other services, such as Collection Digitization Dashboard (CDD), the European Loans and Visits System (ELViS) and the European Curation and Annotation System (ECAS) build on and work alongside the ECOI service to provide a portfolio of DiSSCo services.
Several characteristics, such as centrality, accuracy and authenticity of the Digital Specimen, protection of data, preservation of readability, traceability/provenance, and annotation history are essential for developing long-term community trust in DiSSCo. They are the protected characteristics of DiSSCo that must be protected throughout the DiSSCo lifetime. Thus, all design decisions (technical, procedural, organisational, etc.) must be assessed for their effect on the protected characteristics. Such decisions and changes must not destroy or lessen the protected characteristics.
The FAIR Guiding Principles (Findable, Accessible, Interoperable, Reusable) for managing scientific data [
Whilst adhering to FAIR principles, DiSSCo data management principles aim to be technology agnostic to the greatest extent possible, expecting that over the DiSSCo lifetime specific data management and processing technologies can evolve and will be replaced. A framework for data management must accommodate this and one such framework is Digital Object Architecture (DOA) [
An important concern in digitization is how much detail to digitize from each physical specimen. While a photographic image can be made quickly, transcribing and interpreting all the details from labels, enriching the data with external information, and making specific measurements of the specimen take more time and resources. The idea of ‘Minimum Information about a Digital Specimen’ (MIDS) has been conceived to capture and structure this complexity. In the present document we repeatedly refer to the MIDS levels. Briefly, they are:
At the time of writing (January 2020) the MIDS concept is still under development. While the definitive specification and published article about MIDS is not available, the reader is advised to refer to the discussion of the subject in the Open Access Implementation Guidelines for DiSSCo*
A related concept, Minimum Information about a Collection (MICS) capturing and structuring an approach to describing collections has also been proposed and is under study. The suggested MICS levels are:
The provisional Data Management Plan for DiSSCo infrastructure (hereinafter, ‘the DMP’) [
The intention in the present document is not to reproduce/duplicate all the information, data management principles and requirements documented in the DMP. Readers are referred to the original document for that [
Presently (2020), the task ahead of us, of mobilizing data from natural science collections is still enormous. Ninety percent (90%) of the collections still need to be mobilized. It is imperative for DiSSCo as it is in fact for all stakeholders and the community at large to tackle the backlog as quickly as possible, balancing this with a continued focus on impact and work to understand what makes data most useable and impactful. Arrangements, processes and practices, including organisational consolidation and use of selective partnerships must be initiated to efficiently digitize one and a half billion collection objects in a foreseeable time, at reasonable cost and to make this data publicly accessible. In addition, however, DiSSCo will need to recognise that digitization impact – the critical metric of success - depends on user needs; the aims of specific projects and research; and the research readiness of the data. Mass digitization workflows are still being developed and piloted for some collection types, and others may never be suitable for an ‘industrial’ approach (e.g., the very largest specimens). For these reasons, while focusing on mass digitization, DiSSCo will recognise and support the need for balanced portfolios of digitization activities that include digitization-on-demand capabilities driven and prioritised by researcher needs.
The need for greater alignment between digitization and biodiversity research agendas has been highlighted by both the Alliance for Biodiversity Knowledge (ABK) and in the Global Biodiversity Informatics Outlooks (GBIO). ABK specifically names one of its ambitions as working “with other research communities and infrastructures to achieve interoperability with earth observations, social science data and other resources.” Stakeholders at GBIC2 called for better coordination mechanisms between research infrastructures and coordination between scientific communities with different but parallel expertise. A clear international digital research agenda for collections will be important in determining precise criteria for both mass digitization (3.9 below) and digitization-on-demand (3.9.3 below).
Recommendation 1: It is important to set a clear international digital research agenda that can serve as a guideline within DiSSCo to determine what to prioritize in terms of digitization of collections and specimens in more detail.
The coordination is required both internally within DiSSCo and externally with other research infrastructures. Research priorities and policies related to collections, ICT and data management vary widely across DiSSCo institutions and have a significant impact on how institutions can collaborate and facilitate a collective agenda. The first step in overcoming policy barriers that limit research alignment is identifying these internal policies and potential sticking points. ICEDIG task 7.2 took the first step towards documenting the existence and subject of policies across institutions (3.2 below). This lays the initial foundations for assessing where potential policy gaps, overlaps and barriers need to be anticipated and overcome in developing a common research agenda – and policy framework that supports it – within DiSSCo institutions. This will be facilitated in part by a new DiSSCo service providing a new central index service (marketplace) of expertise and facilities that lists the sources and availability of taxonomic and other scientific expertise, equipment and facilities across the DiSSCo collection-holding institutions.
External coordination between DiSSCo and other research infrastructures is also necessary for expanding the impact of bio/geodiversity research through collaboration with parallel areas of expertise such as environmental sciences, digital humanities, molecular biology and many others. A clear understanding of the existing RI landscape will also ensure there are no duplication of efforts but rather provides the opportunity to take advantage of existing work and potential efficiencies for achieving biodiversity research objectives.
ICEDIG task 7.3 was a major first step in documenting existing research infrastructures and their potential contributions to the research priorities of DiSSCo. In collaboration with GBIF and the ABK, the development of a Research Infrastructure Database has been started, containing a list of 59 research infrastructures (RIs) like or adjacent to DiSSCo [
Recommendation 2: DiSSCo should promote that a global initiative, such as the Alliance for Biodiversity Knowledge or GBIF adopt and sustain the RI Database, ensuring that it is open, accessible and usable as a tool for identifying potential synergies and opportunities for collaboration with other research infrastructures.
Multiple national and European policy and legal issues can be identified as essentially affecting the successful delivery of DiSSCo. By identifying broad policy areas relevant to digitizing natural science collections and the proposed activities of DiSSCo (Table
1. Access and Benefit Sharing (ABS) 2. Data and digital media publication 3. FAIR / Open Data / Open Access 4. Freedom of information (FOI) 5. Intellectual Property Rights (IPR) 6. Data Standards 7. Personal data 8. Protection of sensitive collections data |
9. Public Sector Information 10. Responsible Research and Innovation (RRI) 11. Cloud services and storage 12. Information risk management 13. Information security 14. Collections access and information 15. Collections care, development and scope 16. Digitization strategy and prioritisation |
Each topic in Table
In some cases, such as freedom of information, intellectual property rights, personal data, public sector information, etc., policy areas are covered by national or European legislation, providing for a base level of harmonisation across all institutions. However, there is considerable variation in practical implementation of these policies, with much detail not sufficiently described or left to local interpretation. In comparison to internal, institution-specific policies, externally mandated policy is often generic, technical and not well-tailored to collection-holding institutions. Critically, the ICEDIG survey has revealed major gaps in policy coverage across all institutions, with some relying largely on national legislation that has been locally interpreted but arguably too abstracted from the details of practical implementation to provide comparison with other institutional policies.
Although policies might have been reported as existing, the amount of detail and formality represented by the available documentation varies widely. Furthermore, language barriers create significant potential for misinterpretation during the analysis, as policy documents were submitted in their original languages of Finnish, Estonian, English, French, German and Dutch. The remainder reported that their policies were either in progress, partially complete, developed external to the organisation, or were not in existence.
‘Collections access and information’ and ‘Collections care, development and scope’ have internal coverage in all institutions, which is likely to reflect that physical collections management policy is generally more mature than other elements of digital collections policy. Generic data and ICT policies such as ‘Information security’, ‘Personal data’ and ‘Intellectual property rights’ are also well represented; the legal and/or regulatory implication of these is likely to have been a driver for policy development in these areas. ‘Data standards’, as a more community-driven subject, appears to be less well embedded. Most institutions do not have a formal policy for ‘Cloud services and storage’ or ‘Public Sector Information’.
Another point to consider when reviewing policy coverage is the extent of policy that is followed with little or no internal documentation i.e., with reliance on external documents. For example, if an institution is a member of CETAF there are policies they will be party to (such as those relating to Stable Identifiers and Passports, for example) that are not specifically transcribed into local (institutional) policies. Similarly, when it comes to access and benefit sharing, institutions that are part of the European Union are obliged to follow the Nagoya Protocol, but this may not be stated as a written policy and in some areas (e.g., Digital Sequence Information) subject to significant variance in interpretation. Other examples include GDPR regulations and the CITES Convention.
Not all institutions are able to share their policies externally. Only one third of institutions stated that all their policy documentation submitted could be shared, whereas 60% of one institution’s documentation was not sharable. This is important to consider when creating common policy around digitization as DiSSCo moves toward a common research agenda.
In sum, the ICEDIG policy analysis faced significant challenges around collecting and interpreting policies from ICEDIG partner institutions. This was due to several factors, including difficulties in obtaining relevant policies, differing levels of policy making it harder to map policies to categories, and policy within different legal frameworks and language barriers that created the potential for misinterpretation. To address these challenges within DiSSCo, solutions should consider:
Such actions must be framed in the context of developing a common policy framework that addresses the gaps in provision currently experienced across DiSSCo institutions, as well as supporting local variation in how these policies are implemented.
Recommendation 3: Actions on common policy elements across DiSSCo institutions must be taken in the context of a common framework of policy definition and implementation that recognises the organisation of policies and responsibility for implementation at local level, and authorization for change within collection-holding institutions.
In natural science collection-based research, citizen scientists are most often engaged in transcribing specimen labels. The use of public, web-based transcription platforms, of which there are several available as open-source in the domain*
In general, there is no single best platform to recommend nor any need for DiSSCo to seek to build a single, universal DiSSCo volunteer platform. The community engaged around existing platforms or instances of platforms such as the DoeDat implementation of DigiVol by Meise Botanic Garden*
Integrating the diversity of platforms into digitization workflows is made possible by implementing a common data exchange protocol*
Recommendation 4: DiSSCo should exploit the diversity of available citizen science platforms (e.g., for specimen label transcription), taking advantage of their individual strengths (surrounding community interest, language specificity, etc.) as appropriate and should encourage such platforms to implement the data exchange format and protocol for transcription platforms (doi: 10.5281/zenodo.2598413), as well as supporting this format/protocol in digitization lines, workflows and collection management systems.
As noted above, a strength of existing transcription platforms is their surrounding community of volunteers engaged in collection transcription. Often though, the fact of citizen scientists’ involvement in data enrichment processes is hidden in the final versions of datasets. Giving proper credit to these people for their contribution is morally right and essential for motivating future contributions, as well as being ethically right for the curation and maintenance of collections (i.e., establishing provenance of data). Furthermore, integrating and exposing citizen science activity through DiSSCo dashboards could be a powerful incentive for increased mobilisation of volunteers in the future*
In the future, it is advisable to ensure that attribution details are preserved in transcription datasets, transferred to collection management and systems and published publicly when such data is published. Darwin Core and GBIF Metadata Profile standards allow to some degree to describe also volunteer involvement in data collection and enrichment. A new recommendation for the representation of attribution metadata recently published by the Research Data Alliance*
Recommendation 5: The involvement of citizen scientists in DiSSCo data work and activities must be properly acknowledged and attributed, for example using Research Data Alliance recommendations for the representation of attribution metadata (doi: 10.15497/RDA00029).
For properly engaging the public with collections and collection digitization, deep understanding of the relationships between collections, particularly digital collections; formal and informal education; museum-related citizen science; and the skills and knowledge that these can advance is needed. This is a complex area that ICEDIG has only touched upon briefly*
Recommendation 6: Recognising the likely future increase in citizen science involvement with natural science collections, DiSSCo should further develop a package of business model principles and guidance that collection-holding institutions can use to design and manage citizen science engagements and activities to their collections.
Strictly speaking, the word 'alliance' should be used instead of partnering or partnership because it's the more accurate and wide-ranging term conferring the idea of being united for a common purpose or for mutual benefit. Alliances match parties’ strength to strength and balance control with collaboration. They increase the capacity and capability of each of the involved parties without necessarily asking the parties to relinquish control from one to the other. From the perspective of DiSSCo needs (and discussed in each of the following three sub-sections 3.4.2 – 3.4.4) three main forms of alliance exist: customer-supplier relationships, strategic alliances and stakeholder investments.
Compared to these three kinds of alliance, true partnerships are much more about participation, pairing and merger of individual interests. Although they are concerned with collaboration, separate control is not retained. They not relevant for DiSSCo at present.
Customer-supplier relationships are relationships in which customers receive goods and services and suppliers receive monetary payments or other considerations. Such relationships are usually more tactical than strategic, although they often are important contributors towards some higher strategic goal. They can be straightforward and often involve a tendering process leading to contractual arrangements that usually include a service level agreement (SLA). Customer-supplier relationships can occur on many levels, from institutional through national and regional to pan-European. Choice is normally governed by the available suppliers having the correct solution to meet requirements at a value for money price. DiSSCo and its member institutions are subject to the requirements of EU legislation on public procurement*
DiSSCo has the need for many kinds of customer-supplier relationships, including the following:
Joint ventures, in which the parties/partners/beneficiaries commit resources to jointly pursue common goals can be either operational or strategic, depending on their purpose. Normally they are the latter, being essential to enhancing the value of each individual party and because the parties separately cannot achieve the desired goals on their own. Strategic joint ventures are normally long-term arrangements. When more than two parties contribute, this is a consortium.
Strategic alliances include development of interfaces that do not necessarily include provision of services or products from either side, but which can lead to mutual benefit for both parties. Strategic alliances can include technology, knowhow, and skills transfer. They are governed by bespoke agreements (MoU, consortium agreement, etc.) setting out the ambitions and obligations of each party in the context of the shared goals.
DiSSCo has the need for many kinds of joint venture, including the following:
There is a third kind of alliance – stakeholder investment – whereby one party makes an investment in a second party with a view to, for example ensuring the sustainability and longevity of the second party, being crucial to the first party's operations, or influencing the behaviour of the second party in a direction more favourable to the first party. The second party benefits by having its resources and future sustainability improved because of the investment. A topical example of this kind of alliance might be the situation whereby DiSSCo becomes a member of the International DOI Foundation or the DONA Foundation, or where investments are made for sustainable software development. This kind of alliance is considered further in the sub-section on shared liabilities (6.4.5 below).
The landscape within which DiSSCo is positioned is complex, and its mission and needs encompass:
On top of this, the broader landscape encompasses other related facilities, resources and services used by the scientific community to conduct research such as instruments, archives or structures for scientific information, computer-assisted tools and ICT infrastructure, such as cloud computing environments and Internet communications.
The question then is: How to organise alliances that allow DiSSCo to operate – to deliver what is mandated to the best of its ability and to deliver that better through alliances than DiSSCo could do alone?
Several Roundtables*
Seeking out complementary infrastructures, commercial organisations/industries and other initiatives, and identifying where joint efforts could exist; building on existing investments; and leveraging cross-infrastructures collaboration is all important work that must be continued within the overall DISSCo programme.
Technological innovation, efficient deployment, harmonized and collaborative infrastructure development will be critical. Potential innovators working in closely related fields such as optics, robotics, artificial intelligence, geo-localisation, imaging, lab instruments, data storage and many others are to be further engaged as necessary e.g., for joint development of collective technology and infrastructure that would each on its own require investment and competence beyond the capacity of a single RI, or with innovative or already existing technical capabilities such as Artificial Intelligence, or connective infrastructure such as Access and Authentication Infrastructure.
Organising alliances must occur in the framework of the common digital research agenda (3.1) to enhance identification of optimal potential partners, for example where third party data services can potentially augment the DiSSCo offering and produce mutual research benefit e.g., ELIXIR life sciences services. Quantitative assessment and due diligence must be carried out on relevant activities, services or components of organisations and infrastructures that can inform decisions on alliance making as the DiSSCo blueprint is further developed.
Recommendation 7: DiSSCo should estab lish criteria and procedures for assessment and due diligence of activities, services and components of relevance to any potential alliance that keep in mind the common digital research agenda of DiSSCo.
Convergence of the ESFRI RIs in the common landscape is necessary and should be continued. The different state of RIs and their timing do not help in this endeavour as different facets, like needs and solutions, develop at different speeds. This also applies to the relationship between RIs and the e-infrastructure providers. Nevertheless, RIs will have to invest efforts to identify interfaces with common lines of production from where then common services could emerge. This needs commitment from other RIs – the new ESFRIs, established ESFRIs, and ERICs – to continue this dialogue. Such a dialogue began with Roundtable 5 and will be continued.
Recommendation 8: DiSSCo should continue dialogue with representatives of complementary research infrastructures to ensure convergence towards a common approach, especially in the context of the European Open Science Cloud.
Building better and more effective coordination mechanisms for international cooperation across natural science collections globally is essential to overcome the challenges of an increasingly crowded and complex landscape. Specifically, duplication of effort must be avoided but much more importantly: divergence in approach to enabling global collections-based science must be avoided at all costs.
Recommendation 9: Working proactively with international partners DiSSCo should aim to avoid divergence in technical approach to the support of global collections-based science.
When it comes to joint venture, challenges arise from the distributed nature of the RI. DiSSCo will exist in a complicated landscape of: i) European and international obligations and initiatives (2.1.6) ; ii) legal constraints and regulatory implications (3.8); and iii) national interests. Each of these must be successfully navigated.
Alliances must be forged appropriately, along three main lines:
DiSSCo already has several strategic opportunities available to it that must be further developed during the Preparatory Phase for mutual benefit during the construction and operation phases.
One big strategic advantage for DiSSCo is its fundamental alliance with CETAF, the Consortium of European Taxonomic Facilities*
Another advantage available to DiSSCo is the already well-established network of GBIF national nodes and GBIF itself as a global initiative with the goal to mobilise the world’s primary biodiversity data. Note however, that this also represents an area of overlapping interests and thus potential conflict.
Digitization initiatives around the world (see 2.1.6) such as ADBC/iDigBio, NRCA Digital and National Specimen Information Infrastructure, each with similar aims to DiSSCo have much to offer in the way of mutual support and development in pursuit of seeking global solutions to common aims for digitized and extended natural science collections.
Recommendation 10: DiSSCo should further clarify, develop and nurture the joint ventures (strategic alliances) that will be important to its plans and operations, including with CETAF, GBIF, iDigBio (or equivalent), EOSC, etc.
A DiSSCo Centre of Excellence (DCE) is a designated DiSSCo Facility that specialises in one or more of researching, innovating, developing and operating/performing techniques and/or processes of digitization or other related facets, and disseminating information on same.
There are several ways by which DCE can be established, funded and governed, including for example by combining private sector technology, innovation and training with publicly funded digitization projects (i.e., private/public partnerships) involving the previously mentioned organisations such as Bioshare, Dinarda, and Picturae. (See also 3.9.5 below).
Thematic DCE (considered in detail in the [
The private sector has a potential role to play in several areas of DiSSCo operations, being mainly product or service supply and (where appropriate) maintenance and/or training in the following areas:
There are opportunities, for example in automated text digitization (4.6.2) for partnerships with global companies such as Google.
A further possibility arises when commercial companies use and benefit from the ‘free’ data, agreeing perhaps to give funding in return. Among geological collections in the USA, this is an opportunity that is just beginning to be tapped.
According to the [
Through the ICEDIG project, DiSSCo has worked closely with open providers of data storage and archival services at national level*
Recommendation 11: DiSSCo should exploit generic services for data storage and computation where possible and procure from service providers having the ambitions and aims of open science at the heart of their mission.
DiSSCo has welcomed and fully endorsed*
Recommendation 12: DiSSCo must adopt the FAIR Digital Object Framework (FDOF) and its realisation through Digital Specimen, Digital Collection and other relevant digital object types as the basis for complying with the FAIR Guiding Principles for natural sciences data management, and as the means of delivering FAIR compliant natural sciences data into the European Open Science Cloud (EOSC).
Note: At the time of writing (January 2020), several institutions active in the DiSSCo programme planning are committing towards a proposal for a new EC-funded project under work programme item H2020‑INFRAEOSC-03-2020*
The legal establishment and financing of a supranational research infrastructure (RI) such as DiSSCo can be lengthy, complex and difficult, taking several years to implement. The applicable laws are: community law, the law of the Member State of the statutory seat, and law of the Member States where operations are carried out. Several formalisms are possible and applicable at different stages of an RI’s lifetime.
3.8.1.1. Early stage arrangements
It’s possible during the early stages of an RI’s life (such as for investigation, planning and organisation) to be adequately served by a simple consortium agreement, or an arrangement based on a Memorandum of Understanding (MoU) signed by a few institutions. Indeed, this is the case currently for DiSSCo whereby stakeholders have committed via an MoU to support and contribute to the development of DiSSCo's mission and goals. MoUs are not legally binding but carry a degree of seriousness of intent and have mutual respect from the signatories. Often, MoUs are the first step towards a more permanent agreement with a legally binding basis.
An MoU acting as a framework for cooperation can be further strengthened when accompanied by a set of statutes. Such statutes can, to all intents and purposes be nearly identical to those that would exist in conjunction with more formal arrangements (see below), and as such can help to provide a smooth transition when necessary.
In addition, specific funded projects during the early stages of DiSSCo’s life, such as ICEDIG, SYNTHESYS+ and DiSSCo Prepare are normally organised based on formal consortium agreements between the beneficiaries, who are not necessarily all signatories to the MoU but nevertheless share some common interests and motivations.
3.8.1.2. More formal arrangements
On the other hand, creation of the final, long-term collaboration over many years expected of a supranational RI most likely requires a negotiated intergovernmental agreement. Being backed by governmental/ministerial decisions, negotiated agreements of this type have a significance and weight beyond other formalisms. Although these arrangements can take a long time to achieve, they can be stable and enduring once established.
To support such agreements in respect of European RIs, the European Commission, responding to requests from EU Member States and the scientific community, has proposed a community legal framework - a European Research Infrastructure Consortium (ERIC) - adapted to the needs of new European-level research facilities and of distributed infrastructures as well. An ERIC is an entity with a legal status recognised in all EU Member States. It meets the needs for recognition of the European identity on a non-economic basis, has a flexible internal structure to accommodate diverse types of infrastructures, and provides some privileges and exemptions (most notably, with respect to Value Added Tax (VAT)). Under certain conditions, an ERIC can also include non-EU partner countries.
Another approach is a public benefit foundation, established to carry out the mission of DiSSCo. However, with no members or shareholders, and no formal recognition in European law, such a foundation must be established in a single country. A foundation of this kind must be certain that it is guaranteed not only a reliable income stream for the long-term (e.g., by contributions from its supporting organisations to carry out their mandated activities) but also that the ongoing and long-term support of the stakeholders is certain. It is not always a reliable arrangement.
3.8.1.3. Importance of correct legal status
The choice of the correct legal status is also important from funding and commercial perspectives. A minimal solution such as the MoU approach described above that permits achieving the scientific goals with the smallest possible legal, administrative and financial complications, whilst administratively preferable, could underestimate the need for an adequate legal foundation. For example, with MoUs and other non-legal forms, a group of institutions cannot enter collectively into commercial contracts and other agreements with third parties (such as product or service suppliers). This must be done (and the risk assumed) by one of the institutions alone. Similarly, funding agencies often have formal eligibility requirements regarding the legal standing of entities that submit proposals or are the recipients of grants. Whether this is a single institution acting on behalf of a consortium (as, for example in European framework funding for research) or whether it is with an entity that is a representation and constitution of the institutions collectively, has significant administrative, taxation, employment and financial regulatory implications.
Recommendation 13: DiSSCo should propose the legal form required to achieve its aims and objectives and to administer and support its operations, keeping in mind the need for long-term viability and stability, the need to be able to enter into legal agreements with third-parties, and the need to assume responsibility for and mitigate risks and liabilities.
Many of the 16 policy areas studied by ICEDIG (see Table
Some examples of specific legislation applying directly to DiSSCo activities with implications on data management practices:
Further study is needed to compile guidance for how DiSSCo and its member institutions should implement and demonstrate compliance with legal requirements of such legislation.
In other areas, there is more discretion and flexibility. Legislation and regulations may not apply directly (Freedom of Information and Intellectual Property Rights are two example areas) but taking the appropriate legislation into account when designing data management practices and systems can make compliance much easier to achieve and police.
In both cases, it is sensible that DiSSCo and its member institutions adopt, as far as possible common approaches to data management to comply with such requirements. This is a topic for the data management plan of DiSSCo that is marked for further study in the present provisional data management plan [
Recommendation 14: For each broad policy area affecting DiSSCo activities and directly covered by mandatory legal and regulatory considerations for data management, DiSSCo must list the legislation and regulations that apply at national and/or European level and say how DiSSCo and its member institutions will comply with each of the mandatory requirements (for example, by indicating specific clauses in the DiSSCo Data Management Plan). The broad policy areas are: i) Access and Benefit Sharing (ABS); ii) Data and digital media publication; iii) FAIR / Open Data / Open Access; iv) Freedom of information (FOI); v) Intellectual Property Rights (IPR); vi) Data Standards; vii) Personal data; viii) Protection of sensitive collections data; ix) Public Sector Information; x) Responsible Research and Innovation (RRI); xi) Cloud services and storage; xii) Information risk management; xiii) Information security; xiv) Collections access and information; xv) Collections care, development and scope; and xvi) Digitization strategy and prioritisation.
Recommendation 15: For each of the broad policy areas mentioned in recommendation 14 affecting DiSSCo activities and affected by legislation indirectly, DiSSCo should state what practices it will adopt to make compliance easier to achieve and police.
Specific attention will need to be given to the topic of moving sensitive data about collections, specimens, personnel (e.g., collectors) and places across international borders in cases where DiSSCo collection-holding institutions are not located in Member State countries belonging to the European Union. In such cases, there are additional considerations, such as (for example) whether the European Commission has determined that the country in question provides an adequate level of data protection*
Recommendation 16: DiSSCo must give specific attention (perhaps by implementing a ‘compliance and moderation service’) to the rules governing the movement of sensitive data across international borders i.e., between European Union Member States and third countries (including defining specifically what is meant by ‘sensitive data’ in the context of the legislation affecting DiSSCo operations).
In 2012, the European Commission published a first recommendation, updated in 2018*
Implemented by extending the Open Research Data pilot into all thematic areas of Horizon 2020 funded work programmes in 2017, the Commission’s Recommendation applies also to DiSSCo and to its institutional members when they enter into EC-funded Grant Agreements for project-based work under Horizon 2020, and most likely the following research and innovation framework programme, Horizon Europe. Implementation and enforcement are generally contractually via model Grant Agreement clauses. The Guidelines to the Rules of Open Access to Scientific Publications and Open Access to Research Data in Horizon 2020*
This framework is strongly binding for all publicly funded collection-holding institutions, whether it is explicitly transformed into an institutional policy or not. It makes clear that natural sciences data collected in these institutions must be findable, accessible, interoperable and reusable (FAIR) by default. Thus, requirements and recommendations of EC Recommendations on access to and preservation of scientific information, including research data are implemented through DiSSCo’s open access guidelines (see 4.4 below) and measures for complying with the FAIR Guiding Principles (see 2.3.2).
For scientific collections we define mass digitization to be an activity where entire collections, or their distinct major parts, are digitized from one end to the other, without selecting individual specimens.
Mass digitization is characterised by improved technological and procedural frameworks based on automation (both hardware and software, e.g. use of conveyors, barcodes, machine learning approaches etc.) and enrichment (link-building), although currently and for the foreseeable future the role of human operators/digitizers remains paramount in many stages of digitization. Mass digitization means implementing workflows at industrial scale, i.e., processing millions of objects annually at relatively low cost. The large scale of European natural science collections provides the opportunity and need for such industrial approaches, but also makes it imperative to maintain downward pressure on costs if this work is to be affordable at the scale required (3.9.3).
Mass digitization often contrasts with demand-led/driven digitization (also known as digitization-on-demand) (3.10 below).
Several major factors affect the choice of mass digitization approach, including the size of the collection to be digitized; availability of staff and their time, availability of funding; availability of space to locate equipment; and whether transportation of the collection is viewed as safe and viable. These factors broadly can be used to decide between: i) in-house digitization with own equipment and by own staff; ii) in-house digitization carried out by a contractor using their own equipment installed in the premises; and iii) digitization out-sourced to a contractor, which requires transportation of the collection to an off-site facility run by the contractor*
The type of collection also is a factor. The technology of mass digitization is already well-established for herbarium sheets, which are more-or-less two-dimensional and other objects that can be placed flat (focus depth 10 cm) on (A3 sized) trays. Herbarium sheets are routinely being transported for digitization at remote facilities. Similar techniques are slowly becoming available for pinned insects as well. These are seldom shipped to remote facilities for digitization though, because of their perceived fragility and because the size of the required digitization equipment is smaller overall and often better fits into available space in museum buildings. These two collection types cover approximately 80% of all specimens. For the rest such as liquid samples, microscope slides, most vertebrate material, and non-biological specimens, mass digitization techniques still need to be better conceived; although some progress is evident e.g., [
Recommendation 17: DiSSCo strategy for mass digitization must focus on clearing the historical backlog of undigitized specimens in the next 20 – 30 years, whilst recognising that newly collected accessions, and small and private collections also each require their own organisation (workflow) of digitization to prevent new backlogs from forming. The specific collection type also dictates the appropriate technical approach and although herbarium sheet and pinned insect digitization is well-developed or advancing, greater emphasis must be placed on other collection types, including non-biological ones.
There are currently a wide range of per-specimen costs for digitization, which represent workflow differences, variation between institutions, as well as variation in the aims of projects and the level/extent of digital data generated. ICEDIG has proposed templates for collecting full economic costs of digitization and has begun to record these costs (5.3.3 below). Further evidence will need to be collected on an ongoing basis. Cost effectiveness or value depends on the balance between costs and benefits/impact and may often be positive even where per specimen costs are quite high, provided the reasons for these costs are proportionate to the desired outcomes. However, even if cost effective, higher costs will limit the affordability and achievability of mass digitization for the 90% of collections remaining, so there must be a focus on achieving lower average item costs overall. It is inevitable that, even as average costs reduce, some types of collections (e.g., the largest items, those where 2D imaging does not meet scientific needs, and many of those preserved in spirit) are likely to have higher per item costs. For many types of collections, however, mass digitization workflows exist and are being refined to reduce cost. This includes collection types that account for much of the volume to be digitized, including herbarium sheets and pinned insects. Imaging millions of these specimens should be viable at average target cost of around €0.20 per specimen, and transcription of their essential (MIDS-2 level) data should cost less than €0.30 each when properly supported by technological and automated approaches, and/or done in less expensive parts of the world.
Recommendation 18: DiSSCo should plan to achieve an average digitization cost of €0.5 or less per specimen, across major collections types to which mass workflows can be applied.
Digitization-on-demand offers one key form of prioritisation, leaving choices about digitization to be largely influenced by user needs and requests. Mass digitization, however, also requires prioritization.
Considering that backlogs of undigitized historical material, newly collected accessions, and small and private collections each require a different approach, and that the specific type of a collection to be digitized also dictates an appropriate technical approach, we can now try to formulate some questions to underpin decisions by DiSSCo on how best to move forward with arranging mass digitization. These questions, with some further discussion of each below are:
1 Inhouse vs outsourced: Centralised, specialist digitization teams have become the norm within many European collection-holding institutions [
Recommendation 19: DiSSCo should develop a decision support tool to assist institutions to decide on the optimal strategy for digitization of their collections in-house, outsourced, or mixed approach.
2 Specialisation: Developing new and more efficient digitization solutions requires experience and knowledge of new technologies. While the related research and experimentation can be done at collection-holding institutions, putting the results into operation at large scale requires knowledge of industrial engineering and methods not always found in such institutions (or taking time and resource to develop). It might be necessary for DiSSCo to establish some thematically specialised DiSSCo Centres of Excellence (DCE) in key areas (3.9.5). This could be organised through competitive bidding targeted to interdisciplinary consortia. Some of these centres could perhaps be combined with out-sourced digitization factories.
Recommendation 20: DiSSCo should launch early calls for consortia to establish specialised Centres for Excellence on mass-digitization, in readiness for entering the operational phase of DiSSCo.
3 Managing ‘flow’ versus addressing backlogs; and sharing costs: Institutions have different scales of mass digitization backlogs, and of acquisition rates. For the largest collections, clearing backlogs through mass digitization is an activity with no end yet in sight, and acquisition rates may also be at a scale requiring ongoing mass digitization approaches. For others, however, clearing their backlog may only last for a defined period of years, after which mass digitization equipment is likely to be depreciated, and eventually become redundant or only used at lower scale. Rates of acquisition may not be high enough to support outsourced approaches. The [
Another possibility, after backlog has been cleared at one institution, is to move the equipment to another with a backlog awaiting digitization. This allows costs to be shared and might work best if the equipment is not owned by an individual institution but either shared with other institutions or leased from a specialist company or DiSSCo Centre of Excellence who knows how to move such systems.
Recommendation 21: DiSSCo should promote re-use and/or cost-sharing of digitization equipment across institutions and projects where possible, particularly for smaller collections.
4 Small and private collections: There are many small and private collections across Europe that could, with appropriate support be encouraged and assisted to digitize. However, the approach is different from that for large collections. The incentives and support for this are considered in 3.11 below.
5 Decentralisation: Currently there are three facilities for outsourced mass digitization in Europe, run by private sector businesses: Picturae*
When DiSSCo goes into full operation, we can expect a digitization rate of about 40 million specimens annually. While much of that will be done in-house, more capacity from the established out-sourced facilities will be demanded, and new facilities might be established in other parts of Europe. Another option, as has been done by the Smithsonian is to actively cultivate a collaboration of imaging equipment suppliers working with photographic or manuscript imaging companies to build expertise in natural science and other specimen types – i.e., by taking a more active role, a marketplace for outsourced solutions might be encouraged sooner rather than waiting to see if these occur organically. There are, of course several companies in Europe that might be interested in such an approach.
6 DiSSCo’s role: DiSSCo will provide central infrastructure for data, and guidance and standards for digitization, but will DiSSCo own and operate mass digitization systems and services? This has not been decided yet. Many other RI’s do own and control their central facilities but in the DiSSCo case there are likely to be constraints on the movement of physical specimens, especially across national boundaries. In-house systems would clearly belong to their respective institutions, but it would be the DiSSCo Centres of Excellence and their digitization facilities (if any) whose ownership and role need discussion. Four service clusters (digitization, programme, infrastructure, data management) can be identified as necessary for DiSSCo to offer, as outlined in Table
Four service clusters of digitization-related services identified for DiSSCo.
Digitization |
Programme |
Infrastructure |
Data |
|
|
|
|
These service clusters fit at different organisational levels within DiSSCo – institutional, national, regional, or pan-European. The programme and data service clusters are best suited to being organised at the pan-European level, while digitization services would best fit at institutional level and in regional collaborations.
Recommendation 22: DiSSCo should design a portfolio of services and support fitting to several organisational levels that supports the ambition to organise and consolidate a distributed system of scientific collections across Europe.
Recommendation 23: DiSSCo should determine the required number, locations and specialisations of digitization (or related) facilities across Europe, including Centres of Excellence where appropriate.
7 Warehousing: Scientific collections are typically housed in museums located in city centres at expensive property. Although largely historical, this choice remains relevant today, as museum exhibitions must be located where large numbers of visitors can be easily attracted. On the other hand, combining exhibitions and collections under the same roof is not always strictly necessary. Collections take space. Industrial digitization equipment and logistics also need space and access. One alternative is to move collections of the ‘B’ category (e.g., of common species, a lot of specimens present in the collection, or specimens taking up a lot of space) to a dedicated warehouse after digitization. While digitization can sometimes increase demand for collections, it can also satisfy much of this demand digitally if digitization-on-demand can be offered to the appropriate levels of detail. A ‘digital by default’ access model would take this a step further by taking a needs-based approach to access. In such a model digital access is provided as the norm where possible and physical access is provided only for research etc. that cannot be accomplished digitally. These are all considerations when looking at where scientific collections can be effectively, securely and cost-effectively housed. ‘Warehouse’ type spaces for collections may also allow for improvements in storage unit standardisation, and in automated and robotic approaches. This has been demonstrated in libraries but not yet in natural science collections at scale. It may also be possible to combine demand-driven digitization with mass digitization at such facilities. It is recommended that DiSSCo explores the possibility to work with one or more relevant institutions to build and operate such a facility for an ‘out-of-town’ fully automated, industrial scale specimen storage and digitization facility.
Recommendation 24: DiSSCo should consider building an experimental facility (DiSSCo Centre of Excellence) for an ‘out-of-town’ fully automated, industrial scale specimen storage and digitization facility.
Answers to the above questions are not ‘yes’ or ‘no’ but require optimisation and achieving the correct balance. The questions and their answers are inter-dependent with one another, with different institutions perhaps coming to different conclusions as they balance off the factors.
DiSSCo must offer structured leadership and in digitization and data management approaches, helping institutions reach the right decisions for themselves whilst operating within a consistent framework of standards and best practices, and understanding the full range of choices available to them. This means aiming to develop and disseminate the highest standards of best practice, delivering training for staff, and helping institutions achieve the best digitization results they can in terms of quality (best, according to agreed specifications), time (highest throughput, fast), and cost (lowest, minimal per specimen) for each kind of digitization activity.
Recommendation 25: For each kind of digitization and collection type, DiSSCo should offer structured leadership in digitization approaches, proposing best practice approaches to its institutional members, and helping them to achieve the best digitization results in terms of quality (best, according to agreed specifications), time (highest throughput, fast), and cost (lowest, minimal per specimen) for each specific kind of digitization activity.
Such leadership can be achieved through a ‘centres of excellence’ model (considered in detail in [
Topics for DCEs can be found by looking at the areas where rapid progress is needed and is viable, given the current state of technology readiness. Topics requiring significant expertise pooling and resources would also be suitable, as would a focus on addressing workflows not yet developed to mass scale. DCEs could vary from offering training to active digitization support, or in any combination in between. The ICEDIG project was already organised along these lines, but as a design study ICEDIG did not build anything operational. However, gaps of knowledge and lack of solutions were found in several areas such as robotics and automation of 3D imaging (WP3); AI/ML in image/data analysis and transcription (WP4); involvement and integration of private collections (WP5); and very-large-scale data management (WP6). The next phase of DiSSCo must investigate these areas further with the aim of developing operational mass-digitization solutions. This is the role of specialised DCE consortia.
Intuitively, it’s not enough to just work to mass digitize an entire backlog of specimens, at a relatively shallow level of data mobilization (3.9). In contrast to mass digitization, demand-led/driven digitization (also known as digitization-on-demand) is based on specific requests for selected specimens, and/or on requests for a deeper level of data (e.g., additional images or full georeferencing).
Nevertheless, digitization-on-demand should where possible make use of mass workflows. It can also help to develop and refine new mass workflows; and if adopted as a ‘digital by default’ access route to collections can at some point reach mass scale (i.e., as the level of demand/requests rises) and become efficient, affordable and cost-effective.
Matching digitization to contemporary research needs by prioritising and selecting what to digitize is one very important mechanism for directing mass digitization and maximising the effectiveness of resource allocation. Making digitization available ‘on demand’ in response to specific requests is also essential to meet immediate needs of the community, and to manage access to collections in a world where collections (and user) resources are often very stretched. Similarly, digitization-on-demand should also be flexible enough to include dealing with newly collected material, which is often planned in association with current research projects in progress.
Demand-led and research-ready digitization thus has several facets to it, including:
These are each considered in the following subsections.
For digitization by DiSSCo to be successful it is essential to have a framework of transparent criteria that enables a demand-driven prioritization of the digitization of natural science collections.
Collection relevance, economic relevance, scientific relevance and social relevance are all important categories of criteria, with those of scientific relevance being the most frequently mentioned as used and important for prioritizing digitization of natural science collections*
Decision trees to assess feasibility, scoring methods to assess relative importance and expert panel review for well-informed decisions on relevance are all appropriate component tools for establishing and operating such a framework, to which a clear, harmonized common digital (scientific) research (see 3.1 above) is also an essential input.
Note: The planned SYNTHESYS+ Virtual Access (VA) work will build implementation experience in this area. The VA review panel will have to prioritise digitization-on-demand applications through a set of criteria to be determined by that project.
Recommendation 26: Based principally on scientific relevance but also considering collection, economic and societal relevance and feasibility / cost-effectiveness, DiSSCo must establish a framework of prioritisation criteria and a set of tools and procedures for making and objectively justifying consistent digitization prioritisation decisions.
A Digitization-on-Demand (DoD) service to deal with specific requests from scientists (both individual and groups) and others to digitize selected specimens will be an important high-value service that DiSSCo can offer. Such a service also offers a means of easily and quickly dealing with newly collected material. DoD services can sit alongside other (mass) digitization initiatives.
The SYNTHESYS+ Access programme is presently (during 2020) trialling a 'digitization-on-demand' (DoD) model that can form the eventual basis for DiSSCo DoD services and workflows*
Deciding what data to make available from DoD, or indeed designing how to accommodate a broad range of different kinds of data requests will be an important element of the design. Digital data from DoD services can potentially include basic/regular data (cf. MIDS levels 1, 2), digital photographic images, 3D scans and other complex image types, digital molecular or chemical data, etc. Valuable lessons will come from the previously mentioned SYSTHESYS+ Access programme work on which kinds of digital data are truly valuable.
Substituting DoD for loans of specimens is an important use case here. When a loan request arrives, DoD could be offered first. If that is not enough, only then would specimens be sent. Some institutions have already stopped loans of certain kinds of material (e.g., type specimens) owing to risk. A DoD practice would enable such material to be used digitally, as well as serve as a digital backup in case of loss or damage.
Recommendation 27: DiSSCo should design and promote Digitization-on-Demand services and workflows appropriate to different collection and specimen categories that can be adopted by collection-holding institutions to become part of their normal business of digitization, including for accession of newly collected materials.
Most major collections are organised by a biological taxonomy. This makes it easy to pull out selected specimens for DoD based on belonging to a specific taxonomic group. However, pulling out specimens on any other criterion, such as geography, time, collector, etc., is slow, difficult or even practically impossible.
Pulling selected specimens out of a collection for on-demand digitization creates a tracking challenge, both in the collection itself (cabinet, drawer, etc.) and the collection management system to know what has been digitized and what has not. This must be addressed as part of the introduction of DoD processes.
Establishing a good inventory across entire collections at MIDS-1 level would dramatically improve this situation (as well as providing a comprehensive cohort of Digital Specimen data for indexing by the ECOI). Such an inventory can be made in several ways, such as mass imaging with no manual transcription, or with minimal transcription exploiting OCR (Optical Character Recognition) and AI (Artificial Intelligence) assisted image analysis (see 4.6.3), imaging full drawers, rapid cataloguing of new accessions, and digitization of historical field notebooks. Such an inventory facilitates DoD. In fact, mass imaging and DoD are not opposites but facilitate each other.
Recommendation 28: DiSSCo institutions should consider quickly creating MIDS-1 level inventories of their entire holdings to facilitate access to specimens and planning of more detailed digitization activities, and to create a comprehensive cohort of Digital Specimen data.
There is no single location where information on the extent of digitization of European natural science collections is gathered; where it can be easily consulted and used by decision-makers and scientists. Such information can be helpful when planning digitization (either mass or DoD) of certain taxonomic or geographic parts of collections. The most suitable way to gather and present this information is as a visual dashboard supported by an underlying service that collects, transforms and collates the data.
Different kinds of visualisation are possible, with analytical dashboards having greatest complexity and needing large amounts of collected data to drive them. Initially, DiSSCo’s dashboard focus is on collection-level information. The Collection Digitization Dashboard (CDD) will be an online visual dashboard presentation that makes European natural science collections visible and discoverable. Highlighting institutional holdings and contributions, their strengths and weaknesses, it primarily presents high-level collection data (ref. 2.3.1) for general communication, for future digitization planning and as a data discovery tool. A CDD can also assist a collection to discover its own uniqueness.
Two prototype Collection Digitisation Dashboards were explored in the ICEDIG project, also in conjunction with a Roundtable discussion. The first is available via the DiSSCo website: https://www.dissco.eu/network/ (near the bottom of the page) and the second through The Netherlands country page on the DiSSCo website: https://www.dissco.eu/network*
The Biodiversity Information Standards (TDWG) organisation is presently working on a new standard for Collection Descriptions*
Recommendation 29: DiSSCo’s Collection Digitization Dashboard (CDD) service must be compatible with and implement TDWG Collection Description standard(s) when this becomes available.
Further work is needed now to fine-tune and improve the prototype dashboard implementations based on relevant feedback from the main user groups (which must be obtained), and to move from prototype to robust, sustained service. This includes automating the collection, transformation/collation and presentation of the data as much as is possible for efficiency and reliability reasons. However, for an interim period it is likely that a manual data entry mechanism will be needed whereby collection-holding institutions can login, enter and adjust their own data on an institutional page, which is then rolled-up to the aggregate dashboard pages.
Recommendation 30: The service underlying the Collection Digitization Dashboard should automate as much as possible the collection, transformation/collation and presentation of collection-level information from collection-holding institutions. For an interim period, manual data entry may be necessary to ensure early public availability of collection-level information (i.e., while work to complete automation of data collection is in progress).
Note: The SYNTHESYS+ project has a task dedicated to “integrate[ing] and expand[ing] institutional collection assessments”, where this work can be further pursued. SYNTHESYS+ can help by contributing to a design that is scalable, elegant, easy to link to collection management systems (CMS), and employs the use of people identifiers (ORCIDs) (and their roles) to enhance the usability and automation possible. These developments can also furthered by the MOBILISE Cost Action.
CDD styling (‘look and feel’) should adhere to DiSSCo common design specifications (see 3.12.1.5).
Results from a survey of over 1,000 private collection owners carried out across Europe in 2018*
65% of the respondents surveyed said that they already manage their collection data fully or partly electronically. Over 90% are interested in sharing their data in some way, preferably through a public website, and by listing metadata of their collection in a public register. Most private collection owners (55%) indicated they need tools, such as a dataset template or a web-based digitization platform, together with guidelines (36%) and physical equipment (27%).
Thus, small/private collections will need to receive more attention in the future to identify how these can be most easily incorporated within DiSSCo, both to benefit the community at large and the collections themselves. Potentially, European small/private collections could be digitized with very complete records within a matter of a small number of years (e.g., 5-7) if approached correctly and supported, say with CMS-as-a-service and an appropriately configured DCE (3.9.4 (2), 3.9.5).
Efforts to help private collection owners to digitize their collections should focus on providing information on how to get started with digitization and encouraging deposition/registration via the European Collection Objects Index (ECOI) service.
Recommendation 31: DiSSCo should provide guidelines for how private actors can digitize their collections and share data via the ECOI service and should ensure that the European Collection Objects Index (ECOI) service offers catalogues of private collections.
Among private collections owners, digitization and data sharing activities are considered important overall. These activities are already undertaken in some cases. For those owners thinking to digitize their collections, various kinds of new support from DiSSCo (e.g., tools, platform, training) would be of great help, and might act as an extra incentive to begin. Such support should be targeted initially towards those private collections of special significance and relevance. Lack of time is also often mentioned as a reason not to have started with digitization, suggesting that use of additional volunteers can offer an important additional means to encourage private collection owners.
Accounting for differences between collection types, their owners and the owners’ motivations, communication strategies towards private collections owners should focus on the benefits and mechanisms of digitization and sharing of data online. As well as informing and educating, communications should offer access to appropriate digitization guidance, training and the available supporting tools. In this context it is of utmost importance to be clear about the meaning of the term ‘digitization’, and what is involved and expected. This could be explained, for example as achieving successively more comprehensive levels of digitization over time by aiming for specific levels of minimum information published about and by a private collection (cf. MICS, MIDS for open access, section 4.4).
Recommendation 32: DiSSCo should develop a package of support measures (communication of benefits, education/training in digitization, digitization tools and facilities, access to data sharing platform, use of volunteers, etc.) targeted towards private collection owners in line with digitization prioritisation decisions, to increase digitization of these kinds of collections.
There are many variabilities and uncertainties about private collections, their owners and the extent of existing digitization so maintaining flexibility in communications and support will be essential. Because there is currently no complete, up-to-date list of private collections, it can be difficult to choose the most appropriate channels for communicating. It is also important to bear in mind that besides private collection owners there are other stakeholders (associations, museums, international organisations) that can play an important enabling role in bringing private collections online. Any plan to coordinate this effort should include policy development that discusses and develops strategy for moving private collections into institutions at some point when suitable. To capture only the data without planning for long-term care of the specimens as well is short-sighted. And, it risks giving the impression that the specimens aren’t needed once the data is captured.
As part of its remit to publish minimum information about available collections, DiSSCo should maintain an online inventory (e.g., a website) of available private collections and their characteristics, with an associated protocol for keeping this up to date. Note, that this could be implemented as a standalone webpage/site and/or as a specific filter/facet of entries in the European Collection Objects Index service (see 2.3.1 and 4.2).
Recommendation 33: As part of its remit to publish minimum information about available collections, DiSSCo should maintain an online inventory (e.g., a website) of available private collections and their characteristics, with an associated protocol for keeping this up to date.
3.12.1.1. Adopting off-the-shelf components *
The decision to depend on specific software is as important as any decision to depend on a specific scientific instrument and should be taken equally carefully. The ICEDIG decision to favour CORDRA object server software, the Handle system and the Digital Object Interface Protocol (DOIP) as key components for implementing the adopted architecture (4.1.1 below) is made with full knowledge of the track record and pedigree of the organisations (cnri.net, dona.net) behind these initiatives. This is backed also by evidence of success provided by several well-established leading-edge use cases that include journal article and dataset identification by the International DOI Foundation (doi.org) and the film and television industry supply chain (eidr.org).
Adopting off-the-shelf components enables quick bootstrapping towards prototype and eventually, workable systems, as is being shown through work implementing the Digital Specimen Demonstrator in the ICEDIG project and the ELViS system in the SYNTHESYS+ project. Valuable experience is being gained. Nevertheless, much new software, which is surely essential for many of the innovative functionalities and services that the DiSSCo vision foresees, is still needed. This is software that will persist, becoming key to the culture and practices of the working community it serves. As such, it must be engineered with care as it will be mission critical.
3.12.1.2. Investing for software development
Adequate investment, and time for development, testing, deployment and maintenance, must not be underestimated. Nor must it be thought that such software will just magically happen without active initiatives to create, train and enable the software engineering team that will be needed (3.12.2 below). Proper design and robust implementation within the overall steerage and constraints implied and imposed by the adopted architecture (4.1.1 below) will be essential.
We could provisionally estimate that 8-10 trained engineers full-time over 4 years (i.e., 32-40FTE) plus one senior technical manager/principal architect are needed to bootstrap DiSSCo with the software critical for its first two years of operation. The minimum cost of this is €4m; more (~€6.4m) if experienced, skilled engineers are to be used. Competing with the private sector for competent software engineers is expensive. Costs going forward beyond first two years of operation are expected to be similar i.e., no reduction in effort is expected due to ongoing maintenance and support needs, as well as the need to introduce new capabilities. Regardless of accuracy and confidence in such an estimate, it provides a clear enough view of the reality to allow DiSSCo leadership to make reasonable decisions and financial provisions about how to control the construction to meet its targets [
Recommendation 34: Adequate investment (c. €1.5m per annum) and time (4 years) for new software development, testing, deployment and maintenance must be made if the innovative functionalities and services foreseen by the DiSSCo vision are to be realised.
3.12.1.3. Strategy for development, maintenance and sustainability
Much has been written elsewhere on the strategies available for software development and deployment, ranging from buying or co-developing software with a commercial vendor, through engaging with relevant open-source projects to carrying out bespoke developments. For DiSSCo, a co-development approach around open-source Digital Object Architecture components as presently being nurtured by, for example the C2CAMP initiative*
Several mission-critical ‘engineering framework’ components remain to be identified as the basis for DiSSCo development of shared infrastructure. This subset must also include common engineering tooling for all developers. Such components, chosen from well-supported open source software must become the standard basis for all developments, thus avoiding that individual engineers choose their own ‘favourites’; which can introduce maintenance and compatibility difficulties later. Critical decisions about these software systems must be made carefully, by appropriately skilled engineers to avoid that substantial time/money is not wasted in the future. This is a key responsibility of the DiSSCo Technical Team, supported by the DiSSCo Technical Advisory Board (respectively, 3.12.2.2 below).
3.12.1.4. Timing of development
Overlapping development, deployment and operation will be needed, based on a minimum viable service offering definition. Note, that such development has already begun with the first-generation ELViS system being developed by the SYNTHESYS+ project. The present cost estimate excludes that work. Whereas the DiSSCo programme intends to begin a modest, soft-start to operations in 2024, then development of core components should begin no later than early in 2021.
Recommendation 35: Design and development of core software components needed by DiSSCo should begin no later than early 2021 to allow modest, soft-start to DiSSCo operations in 2024.
3.12.1.5. Common design standards
All software engineering should adhere to common design specifications (‘look and feel’) for DiSSCo services and components, especially where those have interactive user interfaces. These specifications must be developed.
Recommendation 36: DiSSCo should develop common design specifications, especially for ‘look and feel’ of interactive user interfaces, that software components and services should adhere to.
3.12.2.1. Resilience in a community endeavour
How to build a sense of team – of community endeavour and resilience – in a multinational, distributed engineering team? This is the main question facing the organisation of engineering development and operations (DevOps) in DiSSCo. Behind the question lie historical difficulties the biodiversity informatics community has had with weak cooperation across boundaries on software development and infrastructure operations. There have been multiple reasons for this that have included needs to generate personal research outputs as a means of career progression, lack of institutional buy-in to community endeavours and strategies, shortages of cash and resources, and difficulty of sharing funding between institutions. Not least has been lack of appreciation of the importance of using architecture, standards and robust engineering tooling to guide design and implementation coupled with a culture/attitude of ‘I want to do it my way’. This leads fundamentally to interoperability difficulties. DiSSCo is a once in a generation opportunity at the European level to change this for the better.
The entire DiSSCo DevOps team must be led to understand and buy into the common goals to be achieved. They must work from the beginning in close collaboration with non-IT staff that includes collection managers, curators, digitization technicians and scientists/researchers. There are many ways this can be achieved including, for example the use of shepherds/buddies for liaison, focus groups and secondments, and co-design methodologies such as user involvement through prototyping [
Development teams must understand the steerage and constraints implied and imposed by the adopted architecture of the DiSSCo ICT infrastructure (4.1.1 below), and the tools and selected components to be used. Individuality, competence and autonomy of team engineers are to be granted, but membership, collegiality and generosity must be demanded and strongly reinforced to encourage that effort is focussed as DiSSCo management desires and directs. Creating a strong, motivating ‘wiifm’ (what’s in it for me?) for team members is essential.
Example wiifm (what’s in it for me?) for a potential engineering team member: Well-architected, engineered and operated software makes it easier to support users and to introduce enhancements and new functionality as their needs evolve. While using standard tools and development/support practices that include respecting architecture decisions and guidelines to protect and enhance DiSSCo’s essential characteristics, team members have the freedom to express their individuality and competence as professionals to support and interact with users and develop new software, contributing autonomously and responsibly to collective endeavours to meet present and future needs.
The leadership approach to be taken must be based on developing the desired culture through appropriate staff appointments, training and tooling (including harmonisation of tooling) that facilitates teamwork. This will need investment of effort by DiSSCo to reduce costs over the longer term.
Recommendation 37: DiSSCo must establish, train and equip a motivated and cohesive engineering development and operations (DevOps) team where team members have the freedom to express their individuality and competence as professionals to support and interact with users and develop new software, contributing autonomously and responsibly to collective endeavours to meet present and future needs.
Software engineering, deployment and operations (DevOps) are overseen by the DiSSCo Technical Team (day-to-day) and Technical Advisory Board (strategically).
3.12.2.2. DiSSCo Technical Team
A DiSSCo Technical Team is responsible for developing the direction of DiSSCo technologies and plans, and for the day-to-day management and direction of software engineering, deployment and operations across the portfolio of projects in the DiSSCo Programme.
Key responsibilities of the DiSSCo Technical Team include:
Complementing the Technical Team, a DiSSCo Technical Advisory Board (TAB) (6.2.4.2 below) provides strategic and expert consultation services in all areas related to the technical sphere of operation of DiSSCo.
This section develops the maturity of the technical concept for DiSSCo digitization and data management infrastructure, and describes the technological innovations needed for constructing the digitization infrastructure of DiSSCo. It synthesises the various aspects, such as machinery, ICT architecture, imaging techniques, etc., describing possible design alternatives for each process and technical element.
The information and communication technology (ICT) infrastructure supporting DiSSCo data management principles is based on exploitation of three modern approaches to ICT architecture design that combine to create the DiSSCo ICT architectural concept we refer to as ‘DiSSCo Digital Specimen Architecture’ (DSArch). These approaches are:
These approaches have been chosen for several reasons:
4.1.2.1. Digital Object Architecture (DOA)
Several technical approaches were considered as the basis for DiSSCo ICT infrastructure, including the approaches of Semantic Web (Linked Open Data, RDF, Triples) and Object Reuse and Exchange (OAI-ORE, aggregations of Web resources described by resource maps). Historical and current patterns of infrastructure development show these as evolutionary steps in the technology of Web infrastructure. Much more interesting is an emergence of new data architectures. Such approaches include decentralised applications (d-apps) enabled by blockchain technologies, data intensive federations and marketplaces, and Digital Object Architecture (DOA). The last of these is considered from the research data perspective as a new kind of data fabric – the Internet of FAIR Data and Services. Differing from all other alternatives, it is a fundamental extension of the basic Internet architecture*
4.1.2.2. The FAIR Guiding Principles and FAIR Digital Objects
Two decades of scientific research based on ‘Big Data’, coupled with political movements towards open access to publicly funded research has led to recognition of the need to make scientific data increasingly findable, accessible, interoperable and re-usable (FAIR). These four attributes form the basis of the now widely adopted FAIR Guiding Principles [
4.1.2.3. Evolutionary architecture with protected characteristics
Several characteristics of DiSSCo data management are essential to protect throughout and ultimately beyond the lifetime of the DiSSCo data infrastructure for engendering community trust in the value, veracity and reliability of the data to be managed. These are described in detail in the provisional Data Management Plan for the DiSSCo infrastructure [
Recommendation 38: Nine characteristics (centrality of the digital specimen, accuracy and authenticity of the digital specimen, FAIRness, protection of data, preserving readability and retrievability, traceability (provenance) of specimens, annotation history, determinability (status and trends) of digitization and securability) must be protected throughout the lifetime of the DiSSCo research infrastructure.
Nevertheless, considering the expected lifetime of DiSSCo ICT infrastructure, it is inevitable that the infrastructure, its design and implementation will evolve over its lifetime – both to meet new needs from users and organisations but also as underlying technologies change. The ‘evolutionary architecture’ approach [
Protecting the essential characteristics mean that proposals for design decisions and changes (technical, procedural and organisational) must be assessed for their effect on those aspects. Ideally, all design decisions and changes must not destroy or lessen any of the protected characteristics and should aim to enhance one or more of the characteristics.
Recommendation 39: All design decisions (technical, procedural, organisational, etc.) must be assessed for their effect on the protected characteristics. Such decisions and changes must not destroy or lessen the protected characteristics.
This implies the need for a formal and responsive change management procedure that includes analysis and assessment of proposed design changes against each of the protected characteristics with sign-off by a delegated authority, such as a DiSSCo Technical Team. This should include, as explained by [
The current landscape of ICT infrastructures of the DiSSCo collection-holding institutions is a fragmented patchwork of heterogenous proprietary systems reflecting specific needs, policies and procedures of individual institutions. This includes extensive use of proprietary specimen identification schemes. Drawing these heterogeneous approaches together towards more harmonised infrastructure to support Europe-wide services cannot be a revolutionary ‘out with the old, in with the new’ style of change. A sympathetic but strategic approach based on continuous adaptation is needed. In the short-term, this should yield immediate big benefits for a few small changes, whilst minimising disruption to existing systems and procedures. Over the longer term it should encourage convergences towards similar ways of working across multiple institutions.
The starting point for DiSSCo construction recognises that many collection-holding institutions today are already engaged with:
In the future, the DiSSCo virtual collection of specimens will be indexed by the European Collection Objects Index (ECOI) service. The indexed data about collections and specimens can be served in multiple formats (human readable, JSON, XML, RDF, etc.) to meet the various needs of both native and non-native processing applications that will include new Europe-wide services such as the Collection Digitization Dashboard (CDD), European Loans and Visits System (ELViS), European Curation and Annotation System (ECAS), as well as many others. On creation, individual digital specimens will each be given a globally unique Natural Sciences Identifier (NSId) that acts as a permanent and unambiguous reference and an anchoring point on the Internet to all the information known about a specific specimen over decades. It will become increasingly possible to create and identify digital specimens at the point of digitization in digitization projects and programmes, at the time of accession into collections, and even earlier at the time and place of gathering in the field. Curation by the community experts, long-term preservation of the digital specimen data, and positive feedback to points of origin will become the norm, as will publishing to aggregators such as GBIF by DiSSCo on behalf of its member institutions.
The action steps of an implementation strategy and the phasing of that strategy must support both innovative collection-holding institutions that want to advance more quickly and those acting more conservatively in moving towards DiSSCo goals. For the near future, a key assumption is that museum catalogues remain as core business of the collection-holding institutions with retention of control over the authoritative information about specimens in their collections. However, within ten years, this is expected to change towards a model of curation by acknowledged and appropriately authorised community-experts outside of any specific institution.
Recommendation 40: Within ten years the institution-centric collection curation model should evolve to support complementary digital curation by appropriately authorised community-experts.
DSArch should thus be implemented aggressively to allow soft-start operations to commence at the earliest opportunity (3.12.1.4) via a multi-stage construction approach around two key lines of activity: i) hub infrastructure indexing Digital Specimen and other object types, and offering added value services such as ECOI, ELViS and ECAS; and ii) data coupling to populate the initial systems/services with relevant data.
Recommendation 41: DiSSCo Prepare should follow an aggressive ICT implementation strategy and construction plan based on two key lines of activity that include i) DiSSCo Hub implementation indexing Digital Specimen and other object types, and offering added value services such as ECOI, ELViS and ECAS; and ii) data coupling of DiSSCo Facilities to populate the systems and services with relevant data.
At launch, the DiSSCo Hub infrastructure is a platform*
Fig.
DiSSCo Hub infrastructure will be implemented with a system design based on multiple instances of the CORDRA*
Enrichment of Digital Specimens either/both at time of creation or/and later is made possible by the algorithms of the ‘Tahana link builder’, a software component that finds and exposes the third-party data associated with natural science specimens to build the DiSSCo knowledge graph. Services such as ECOI Elasticsearch make it possible to query and find data of interest across multiple object types/repositories. Each Digital Specimen will have its own human-readable Web ‘landing’ page, as well as being both machine-readable and machine-actionable*
Built-in DSOL behaviours (operations) act on Digital Specimen objects directly, for example to update/modify them, to enrich them with new data, to tag and classify them, to access their images and rights and to query across them. The built-in behaviours are to be standardised as extended operations in the specification for open Digital Specimens (see 4.3).
A DSOL API allows service developers to build demand-led applications that can make use of and extend the built-in behaviours to interact directly with Digital Specimens. One example of this is the eventual integration that can be made with the SYNTHESYS+ Specimen Data Refinery (SDR)*
Recommendation 42: DiSSCo Prepare should specify the Application Programming Interface(s), API needed to allow third-party software applications to be built on top of the DiSSCo Hub (core) infrastructure.
Data coupling begins with harvesting existing Darwin Core Archives or ABCD-A files published by collection-holding institutions (for example, using respectively the GBIF Integrated Publishing Toolkit, IPT or the BioCase Provider software), and transforming their content to create Digital Specimens from the records contained within those Archives*
Recommendation 43: DiSSCo should commence further development and hardening of the specimen data harvesting and transformation process as the means of creating Digital Specimens and populating DiSSCo data infrastructure.
Initially, this is a one-way publication or data sharing process based on proactive harvesting by DiSSCo Hub. It will need to accommodate expected changes to the way data is published by institutions, for example phasing out support of Darwin Core Archives in favour of Frictionless Data packages. Later, data coupling moves towards a more sophisticated bidirectional interfacing for circular near real-time updates of records in the CMS of collection-holding institutions, based on (for example) digital curation and annotation activities that take place around the Digital Specimen itself.
Bidirectional transactional updating will be piloted first between DiSSCo’s own CMS-as-a-Service and ECOI and later be packaged and rolled out as an interface component for adoption by DiSSCo Facilities.
Later in the evolution of the data infrastructure involves change from a centralised DiSSCo Hub deployment to a more decentralised deployment whereby collection-holding institutions become more involved and responsible players in operating the core elements of a distributed DiSSCo ICT infrastructure. Although many years away, this is mentioned here as the overall trajectory to keep in mind for the DiSSCo lifetime.
Multiple innovations are needed for the implementation and operation of the DiSSCo data infrastructure based on the proposed approach and direction (2.3). Many of these have been worked out at high level during the progress of the ICEDIG project. They are listed in Tables
Recommendation 44: DiSSCo should ensure and provision the further work needed (during the DiSSCo Prepare Preparatory Phase project) to bring necessary innovations needed for data management infrastructure to the required level of technical, organisational and financial readiness. These innovations include: i) NSId PID scheme, ii) openDS standard, iii) MIDS/MICS minimum information standards, iv) FAIR Digital Object Framework.
In addition to the key innovations mentioned above, further technical work is needed on many infrastructure building blocks that include (for example):
Maximising flexibility of DiSSCo to support unimagined uses of the data in the future implies an approach based on a foundational platform, APIs and an app-store for DiSSCo scientists/users. Visually, this can be represented as explained/illustrated in section 4.2.2 above.
A decision on adoption of an appropriate persistent identifier scheme, meeting the requirements set out in Appendix C is needed as soon as possible. Lack of decision constitutes a substantial barrier to beginning pilot operations of the DiSSCo infrastructure because of the implications of permanently and persistently assigning identifiers to digital objects that might be difficult to change later.
A proposal for a Handle-based scheme such as the Natural Science Identifier (NSId) scheme proposed by ICEDIG (5.3.2.7) must be made to the DONA Foundation for a persistent identifier scheme to be used by DiSSCo. This is imminent and urgent, as there are other initiatives in other places (IGSN, EUPS, Extended Specimen Network, etc.) with political implications.
A first draft specification of a standard for open Digital Specimens (openDS) is needed as soon as possible. A MOBILISE COST Action workshop took place in Warsaw, Poland 11th/12th February 2020. The outcomes of that workshop must be acted upon. Specifically, what is needed to support open Collections is an essential first step.
Standards covering the Minimum Information about a Digital Specimen (MIDS) and Minimum Information about a Digital Collection (MICS) are needed as soon as possible. A first draft MIDS specification document (v0.8, http://bit.ly/MIDSv08) exists but is incomplete. It needs further improvement, validation and agreement. Also see the related discussion in ICEDIG project deliverable D6.5. Little work exists so far for MICS, which also needs to align to both CDD and TDWG CD work. Also needs implementation and evaluation work. The current expectation is that version 1 of the TDWG CD standard might be available by September 2020, with a version of the GBIF World Collections Catalogue available by then as well. DiSSCo is also planning its own Collection Description Identifier Registry (CIDR) as part of its ECOI service.
Finalisation and stabilisation of the FAIR Digital Object Framework (FDOF) must be achieved during 2020, including addressing and solving the inherent security of digital objects, i.e., by ensuring relevant identification and authorisation mechanisms are embedded in FDOF. An important open issue is to ensure common understanding of the difference between being machine-actionable and being machine-readable, and to ensuring that FDOs (and by extension, Digital Specimens) are machine-actionable*
As noted in 3.8.3, the legal framework around open access to research data cannot be ignored by DiSSCo and its member institutions. A detailed analysis by the ICEDIG project of the existing data policies from several collection-holding institutions*
Minimum information standards (such as the proposed standards for Minimum Information about a Digital Specimen (MIDS) and Minimum Information about a Collection (MICS)) can assist with mobilising incomplete data. Even partially available data can be useful for some purposes and it is recommended that data and media be made openly accessible after minimal delay. Digital Specimen objects must be findable and accessible already at the lowest (MIDS-0) level. Collection information must be findable and accessible, even at the level of overview information (MICS-1 level).
Recommendation 45: Open access policies of DiSSCo and its collection-holding institutions should include that Digital Specimen objects must be findable and accessible, even at the lowest level of available information (MIDS-0 level). Collection level information should be findable and accessible, even at the level of an overview (MICS-1 level).
For long-term storage and archival of image data, which involves storage of many files with large size, institutions (and DiSSCo as a whole) have choices of institutional, national, EUDAT, Zenodo or other public repositories and choice can vary from one institution to another (see 3.7.1, and also 4.7). It is a matter for each DiSSCo institution to determine the best choice according to its own needs. Nevertheless, such choices must keep in mind the need to achieve and maintain FAIRness i.e., to comply with the FAIR Guiding Principles as set out by DiSSCo policy and the data management plan for the DiSSCo infrastructure [
For non-image data (meaning principally the authoritative data about specimens and collections, as well as pointers to supplementary data)*
Recommendation 46: Open access policies of DiSSCo and its collection-holding institutions should include that: i) image data and its immediate metadata should be deposited in a trusted public repository of the institution’s own choice, and ii) that other (non-image) data should be deposited in the European Collection Objects Index (ECOI).
In separating storage of image and non-image data, care must be taken with duplication of data/metadata in multiple places to avoid future synchronisation difficulties. Best practice is that media asset management systems used for long-term storage of image data should contain minimal metadata pertaining only to the characteristics of the image itself and should avoid metadata stating anything about the content of the image and what it may represent. When using public data repositories such as EUDAT, Zenodo, Wikidata and others, where often there is a requirement to submit metadata about the content of the item deposited (images, in this case) careful thought should be given to the reasons for choosing such repositories, the nature and status of the items to be deposited there and the likelihood of future modification of the metadata.
Recommendation 47: DiSSCo should develop guidelines for its member institutions for determining whether and when it is appropriate to deposit specimen data, such as images of specimens in long-term public repositories such as EUDAT, Zenodo, Wikidata and others, having regard both for the purpose of such depositions and for the stability of the metadata describing the content of the deposition (i.e., what, where, when, who).
As far as possible, projects must enable third parties to access, mine, exploit, reproduce and disseminate this data by using a copyright waiver such as CC0 or an open access licence such as CC-BY.
Recommendation 48: Open access policies of DiSSCo and its collection-holding institutions should include that as far as possible third-parties must be able to access, mine, exploit, reproduce and disseminate data by using a copyright waiver such as CC0 or an open access licence such as CC-BY.
However, there are essential legal and ethical reasons for restricting access to data. Thus, DiSSCo policy on open data access should be “as open as possible, as closed as (legally) necessary”.
Recommendation 49: Open access policies of DiSSCo and its collection-holding institutions should include that access be “as open as possible, as closed as (legally) necessary”.
Exceptions to the open as possible policy must be stated clearly and must be justified strictly according to objective criteria defined by national security, legislation or other regulatory compliance, sensitivity of collection information, and third-party rights (such as personal privacy). Restrictions that do not have a justification based on objective criteria in legislation are legally invalid and not permitted (see also 3.8.2).
Recommendation 50: Exceptions to the ‘as open as possible’ data access policy of DiSSCo and its collection-holding institutions must be justified based on objective criteria, stated clearly and strictly limited to reasons of national security, legal or regulatory compliance, sensitivity of collection information, and third-party rights.
Robust technical mechanisms must enforce an “as open as possible, as closed as necessary” policy. However, such mechanisms must not place onerous and convoluted obligations on data publishers nor must they decrease usability for users. Digital Specimen objects must be secure by design. The precise mechanism to be used remains for further study*
DiSSCo should undertake its service management responsibilities in compliance with a lightweight IT Service Management (ITSM) framework focussed on the holistic organisational/business perspective of service delivery. This will help to ensure that all the individual business aspects of delivering services portfolio in a professional, robust and reliable manner – including the upskilling of staff and increasing the digital capabilities of DiSSCo Facilities – are properly taken into account through defining, producing, and providing the required services, and responding to changes in the environment in which they are offered. Such a framework provides a mechanism to align ICT needs with governance and business model, provide a process-oriented approach with clearly identified process owners and managers, emphasise avenues for continual improvement, training and awareness. In short, provide a tool to manage a full-service life cycle from concept to deployment to daily operation.
There are two candidate frameworks appropriate to consider*
Recommendation 51: DiSSCo should adopt a lightweight ICT service management framework for the holistic delivery of its service portfolio.
4.6.1.1. Microscope and other slides (D3.2)
Microscope slides form part of natural science collections in herbarium, museums and many other collection-holding institutes. They are unusual compared to the other preservation types or collections covered in this section as they are rarely curated as separate collections but stored as supplementary collections alongside a range of “classical” collection categories including entomological (both as whole slide mounts and preparations of parts like genitalia), botany, zoology, palaeontology and mineralogy.
The specific preservation methods, labelling practices, dimensions and storage are very variable. It is probably due to these properties that there have been limited mass-imaging methodologies published and considered for slides in general. While there have been several pilot projects that have used specially modified histology slide scanners adapted for natural science specimens, they cannot accommodate damaged slides or slides with non-standard thickness or length - issues that can be frequent in natural science collections.
Within ICEDIG two mass-imaging workflows were developed to digitize microscope slides as discrete collections with adaptable methodologies for inventory (MIDS level 0-1) [
Recommendation 52: For mass digitization of microscope slides, the recommended approach in the first instance is to digitize on the lowest MIDS level (0,1) capturing an image of the whole slide including its labels. These images can be used later for extended data entry (MIDS 2,3), conservation assessment and subsequent research-grade imaging.
Like other collection types, the labels on microscope slides are well suited to automated image processing and information extraction techniques like OCR to capture label data. Due to the spatial constraints of the labels they are potentially more amenable to these techniques than other collection types (see 4.6.3).
4.6.1.2. Vertebrate and other dry three-dimensional collections (D3.3)
Mass-imaging of vertebrate material (skins, bones, nests, eggs, etc.) and other dry three-dimensional collections (molluscs, fossils, etc.) has had limited take-up to date. Currently only 2.9% of phylum Chordata preserved specimens in GBIF have an associated image (furthermore, a significant subset of these images is images of the catalogue, not the object itself). These collections are smaller in number and are not as easy to handle and position as herbarium sheets and microscopic slides. No existing automated imaging solutions have been found for these types of specimens, although conveyor-driven imaging lines commonly used for herbarium sheets could (in principle) be rigged to image anything that fits on a tray narrower than the conveyor belt (typically 60 cm) and covered by the camera focus depth of field (maximum 25-40 cm)*
Recommendation 53: For mass digitization of vertebrate and other dry three-dimensional collection objects, the recommended approach in the first instance is to digitize on the lowest MIDS level (0-1) combined with label imaging. These images can be used later for extended data entry (MIDS 2-3).
The extended potential of the images of labels lies in the use of optical character recognition (OCR) and handwritten text recognition (HTR) (and as training for those purposes). This approach also allows the use of thesauri for quicker data entry with less transcription errors, while still being able to check the data against the original label without the time-consuming task of retrieving the object.
Because this approach uses imaging for labels and subsequent data entry, inclusion of the specimen in the image is optional. Reasons to include the specimen comes from the added potential for curation, research and narrowing down loan requests, as well as future developments such as AI for species recognition.
It is acknowledged that there can be cases where imaging is so complex or where data entry can be done so fast that this approach is not efficient. The GBIF task force on accelerating discovery advised a tiered strategy with rapid and least expensive steps during the first phase, with a second phase for more detailed data capture and imaging [
There can be an efficiency gain from combining with other tasks: all handling actions have some risk for errors or damage to the specimen/label, so minimising the handling of specimens is crucial. Combining tasks such as barcoding, cleaning, repacking and tissue extraction with digitization efforts can lead to an efficiency gain in certain situations. On the other hand, more complex tasks as part of a larger process can produce more errors than when these tasks are done separately by specially trained workers.
Most imaging of these kinds of collections is done with a camera mounted on a copystand or tripod. For imaging for overview and inventory purposes some existing mass digitization systems can be adapted, such as SatScan type scanners and herbarium conveyor belts, taking into account the great depth of field requirement for these collections. Picturae, for example uses a configuration for imaging that photographs from two sides simultaneously. This could be adapted for imaging vertebrate and other dry three-dimensional specimens and their labels from two opposite sides.
Recommendation 54: Setting up imaging protocols and stations for digitizing any type of collection should involve a professional (or experienced) photographer to establish the proper lighting and camera settings for the collection.
4.6.1.3. Liquid preserved specimens (D3.4)
Few institutions hold wet collections on a ‘mass’ scale (as defined elsewhere in the present document) if compared with other collections types such as herbarium sheets or insects. Mass digitization of samples stored in liquids has rarely been done, as their handling and transportation requires special care, but also because of the limited view a two-dimensional image gives of these specimens. Handling of liquid specimens comes with several risks and impediments, including fragile containers, and dangerous chemicals. Imaging of liquid collections can essentially be divided into two categories: whole jar/container imaging, and specimen imaging by removal from the container. Limitations occur due to various factors, including (for closed container imaging) distortion by the container, discoloured liquid and specimen, obstructing labels, multiple specimens in a container (sometimes in separate vials) and so on. Liquid samples are present in all life science collections. Generally, almost all the different taxonomic collections within an institute have their own liquid collections, storage protocols and digitization status.
For mass (or rapid and affordable) digitization purposes, it is deemed unrealistic to open containers for imaging the specimen, unless containers are homogenous and easy to open. Label data capture can be achieved through whole container imaging, for a part of the collection. Whole container imaging can give an impression of the specimen. Further, the condition of the container can be recorded. Like with dry three-dimensional collections, the recommended focus is on capturing the label data in the first instance, and any related field books or ledgers. Data entry can be done directly from the object to the appropriate MIDS level. When enough data is legible on the labels without opening the container, the same approach as for dry three-dimensional specimens can be followed: to digitize on the lowest MIDS level (0-1) combined with label imaging. These images can be used later for extended data entry (MIDS 2-3) and allow for the potential uses described for dry three-dimensional objects.
Recommendation 55: For mass digitization of vertebrate and other dry three-dimensional collection objects, the recommended approach in the first instance is to digitize on the lowest MIDS level (0-1) combined with label imaging. These images can be used later for extended data entry (MIDS 2-3).
A test was done of imaging solutions to deal with curved containers. When labels are so large that they are not legible from a single photo because they curve around the jar, then multiple photos are required, and data entry is impeded by having to switch between multiple photos. Stitching of images to create a virtual “inverse panorama” or “rollout” would allow data entry from a single image. In the future, NHMUK plan to test the output with their ALICE software for label extraction [
A new approach to digitize wet specimens outside of their container has been developed using a flatbed scanner and 3D printed containers [
In conclusion, some hardware solutions have been proposed to speed up imaging, but efficiency must be mainly sought in workflow design and size of the project. In summary, a lean process is suggested as described for vertebrate digitization.
4.6.1.4. Pinned insects (D3.5)
The main challenges in mass digitizing insects stem from the facts that pinned insects, with their equally pinned labels, are i) basically three-dimensional objects and ii) their numbers in collections are huge; up to one billion objects in Europe. The fastest available digitization systems of today [
An example of one innovation towards this goal is the ENTODIG-3D prototype built by the ICEDIG project, which employs Tensorflow AI/ML image analysis, multiple webcams on rails to make focus-stacked images of insects and labels from various angles and builds 3D models of individual insects and visible parts of their labels [
The big question is: Do we really want to make pictures of 1 billion insect specimens? Would it be better to just transcribe the labels? This is really what most curators prefer, because in many insect groups a picture of the specimen is not enough to determine the species. Examination of its genitalia is required and that cannot be done from a whole specimen image. The question is not yet answered but it can be pointed out that transcription of the labels, if done without imaging of them, must necessarily be done in-house at the institution that owns the collection. This defies the possibility to do the transcription in other countries, where labour costs could be lower and local knowledge of language, geography, and taxonomy often is available. Transcription without imaging also means that users of this data are unable to view and verify the verbatim labels.
Ultimately, it is a question of cost. To reduce costs the number of human operators must be reduced as much as possible. A fully automated line where humans just bring the insect drawers to the digitization line would be ideal. The ENTODIG-3D solution mounted on a conveyor shows how this can work.
Another approach to reduce the costs could be to use inexpensive but fully automated imaging stations. If the capital cost of each is well below €10,000, dozens could be installed in parallel in each museum. Human operators would only bring one drawer to imaging once or twice a day. For a collection containing 50,000 drawers with up to one million unit-trays, ten imaging stations operating in parallel 24 hours per day for twenty years would do the job.
In these fully automated scenarios one problem remains: How to attach unique identifiers such as a QR-code to each physical specimen? It does not take much time to attach them, but still would mean that each insect is handled by a human for a few seconds. One possibility that could be explored is to make the image of the insect itself the unique identifier! An image taken from a standardised position could be hashed, for example into SHA256 file hash checksum, which becomes the practically unique identifier for the specimen. If we look closer, like human faces, no two insect specimens are identical. Great progress has already been made in automatically identifying insect species from each other automatically, e.g., [
Recommendation 56: Further explore combinations of new technologies such as robotics, 3D modelling, machine learning, etc., in novel ways to achieve imaging (including of the labels) of 5,000 insect specimens in 24 hours by one workstation and operator.
4.6.1.5. Herbarium specimens (D3.6)
The imaging of two-dimensional objects, such as herbarium specimens, is perhaps the most advanced and well understood area of collection digitization. Nevertheless, the preparation of herbaria for digitization is complex and requires considerable organization and planning. It requires a workforce that may need training to fulfil the needs of the digitization workflow and there are infrastructural needs for digitization that need consideration. For example, space is required for the imaging equipment. A staging area is needed for specimens and for decontamination – a walk-in freezer is the preferred method. Herbaria are housed in a wide variety of buildings, each with their own idiosyncrasies. The width of the doors, the number of floors and the height of shelves all need consideration.
A digitization project often starts with the estimation of the size of the collection and the different types of objects within it. Counts are needed of unmounted specimens, or those needing restoration, also non-standard specimens and those with envelopes that may need opening. Decisions must be made at this point about how digitization is going to be done, whether it is onsite or offsite, how much mounting is going to be done and the timetable for completion. All this information must be fed into the tendering process and attention to detail at this time will ensure the process will run smoother later on*
Some processes are peculiar to the preparation of a herbarium for digitization. Pest management must be specifically considered. As all the collection is going to be moved during digitization it makes sense to integrate pest management processes into the workflow. This also ensures that pest problems are not made worse by cross-contaminating collections and moving pests between rooms.
Each institution has a unique combination of infrastructure, user requirements, staffing and funding that mean that no two herbarium digitization projects are alike. However, underlying these projects is the need for a quality digital product, minimum damage to the collection and long-term needs for the collection’s preservation. Case studies from European herbaria*
The complexity of herbarium digitization can appear trivial until one starts to examine the complex decisions and dependencies in the process. Often it is easy to concentrate on the imaging equipment and the digital infrastructure. However, personnel and physical herbarium constraints are just as important. ICEDIG project work in this area*
Recommendation 57: DiSSCo should prepare and promote community-built guidelines and checklists for future digitization projects, to assist the detailed preparation, costing and management of such projects.
Fully automated conveyor-driven imaging can produce nearly a terabyte of data in a working day. This data must be quality controlled as part of or soon after the acquisition process, and then be further processed to detect barcodes and other pertinent details from the images. It takes a complex and powerful imaging and data processing system to do these tasks. Such systems currently are proprietary to the industrial service providers of mass digitization solutions. Data then must be transmitted to the institution that owns the collection, and imported in the institution’s CMS, possibly through a separate transcription system. DiSSCo Facilities and their service providers must automate this pipeline somehow. Eventually this requires standardisation so that the various steps can be seamlessly integrated, and so service providers can be switched as competitive tendering determines. Ideally, an automated mass imaging system would output MIDS compliant Digital Specimen objects to begin with. It might make sense to create an open-source package to process data from imaging pipelines to harmonise, simplify and bring down the cost of large-scale deployment of conveyor-driven mass digitization lines.
Recommendation 58: DiSSCo should investigate standardization of the interfaces of the components of mass digitization/imaging lines and their downstream data stores and processing elements (including quality control) and encourage open-source tools development in the area.
4.6.2.1. Available options
Most specimens are documented on paper before their details are entered into a database. In many cases, specimen data only exist as handwritten data on labels, in catalogues or in field notebook. This presents an interoperability challenge between digital and paper records. Accurate and efficient transcription is clearly an essential step towards making these data findable, accessible and useable. Different options exist for the transcription of these data, such as in-house transcription, transcription outsourced to a commercial company, and volunteer crowdsourcing from the general public. Each of these approaches comes with advantages and constraints.
Essentially, the accuracy of all human transcription methods proves to be similar and the choice between them comes down to the options open to the collection-holding institution*
4.6.2.2. Improving manual transcription
In furtherance of improving manual transcription, DiSSCo could provide additional cross-checking tools and aids. It is clear, for example that data quality can be improved with the use of lookup lists. Speed can be increased two to threefold by comparing the details of specimens being digitized with other previously digitized specimens from the same collecting events [
An interactive nomenclatorial reconciliation service would allow transcribers to verify that names are correctly spelt. However, being able to distinguish, after transcription has been completed between actual (spelling) errors in the data as it was originally recorded, and errors introduced during transcription is essential. Thus, in such discovered cases of original misspelling, the verbatim data must be annotated to highlight the fact. It is essential to retain the data verbatim, as well as to provide it in corrected/improved forms. The quality control and assurance plan of a transcription project or process must be clear on how to proceed in this and similar cases, and community-agreed best practices should be applied.
Giving attention to data standardization and total quality in manual transcription pays dividends, especially through the greater use of automation. Focussing on identifying the points in transcription projects, processes and workflows where errors can arise and designing for their prevention, combined with strict, enforced use*
Recommendation 59: DiSSCo should work with collection-holding institutions to improve the standardization of transcribed data in line with the emerging Minimum Information standards for Digital Specimens and Digital Collections (MIDS/MICS) and should seek to introduce community-agreed quality control and assurance plan and procedures for transcription.
In the longer-term, the European Collection Objects Index will prove to be an invaluable resource for the correction of errors and the overall improvement of data quality by the application of centralised analysis, dynamic visualization of current specimen data and curation by the community at large.
4.6.2.3. Field notebooks
A special case of manual transcription is digitization of field notebooks. Each major museum collection-holding institution has hundreds of them from the times before the 1950’s. With ink pen it was not possible to make small labels for insects, so only collector name and a catalogue number (collector´s field number) was attached to the specimen and all other details were written in the notebook. Only data equivalent to MIDS-1 level data is available from the specimens while the corresponding field notebook has not been digitized. Field notebook digitization projects have been carried out at least by the Smithsonian Institution*
Recommendation 60: Digitization of field notebooks should be a priority and should precede digitization of any related specimens.
The richness of information in field notebooks can be amazing. It might be a practice worth reintroducing by using electronic means, although it may not be applicable for all collection types. A physical specimen would only need a QR code label as its specimen label. All other data would be in an electronic field notebook, entered for example via a smartphone or tablet computer and almost immediately accessible through the DiSSCo infrastructure.
Recommendation 61: DiSSCo should consider developing and offering a digital field notebook service.
4.6.2.4. Georeferencing
To maximise the use of specimens in research it is helpful to be able to accurately determine the point of collection from the textual information on the label. This process is known as georeferencing and is largely carried out manually today, being time-consuming and costly. There are currently several tools available to assist with and to some extent automate georeferencing*
New techniques from natural language processing and geospatial information analysis can contribute to improved semi-automated and automated methods to speed up georeferencing. These would have impact not only on time and cost, but also on quality and utility of data and the georeferenced specimens that data represents
Recommendation 62: Adopting new techniques from natural language processing and geospatial information analysis, DiSSCo should investigate the improvement of georeferencing techniques for identifying more precise locations from natural language descriptions of locations that appear on specimen labels.
Optical Character Recognition (OCR) can be used to digitize text appearing on images of specimens, for example where labels appear in images*
Recommendation 63: AI-assisted image segmentation should be further developed by DiSSCo for routine use as a step in the digitization and transcription process.
Not all text on specimen labels is nicely printed. Handwritten text varies considerably and does not conform to standard shapes and sizes of individual characters. This poses an additional challenge to OCR but the application of techniques such as those utilized by Google Cloud’s Vision AI product*
Recommendation 64: DiSSCo should consider an agreement (operational or strategic) with, for example Google for AI-assisted text recognition of specimen labels, including handwritten labels at industrial scale.
Determining the countries and collectors of specimens has been one aim of previous automated text digitization research activities. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task, recognizing location names and person names within the text extracted from segmented images via Tesseract version 4.0.0. NER can be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to correlate the named entities to places and people mentioned in the biodiversity literature. It can be used with new, emerging georeferencing techniques that can infer locality from sparse and vague descriptions on labels (0).
OCR, automatic language identification and NER, correlation and georeferencing are just a few of the main components needed for improving the efficiency and quality of transcription and the utility of transcription data products. Further work should be undertaken to stabilize appropriate techniques and bring these together into deployable robust, semi-automated and automated pipelines (workflows)*
Recommendation 65: As part of the DiSSCo programme portfolio, DiSSCo should sponsor a research and innovation project investigating novel approaches to label segmentation, transcription (i.e., text OCR and digitization) and interpretation (i.e., named entity recognition, georeferencing, people referencing, etc.) and synthesis and deployment of robust production pipelines (workflows) to improve efficiency, quality and utility of transcription data.
Many systems exist for performing three-dimensional (3D) imaging of cultural and natural heritage artefacts. From these, several rapid 3D methods are considered as potentially viable for integration into mass 3D digitization of natural science objects*
Viable rapid 3D methods include multi-plane photography, which is a technique that creates two-dimensional picture sets or stacks, that can subsequently be manipulated by software to give pseudo-3D views of real-life 3D objects. In contrast, other techniques directly create 3D computable models. Photogrammetry uses methods of image measurement and interpretation to derive the shape and location of an object from one or more photographs of that object. Structured Light Scanning uses the methods of stationary fringe projection based on a fixed grid of fringes generated by a projector and observed by one camera. Laser Scanning describes the three-dimensional measurement of the surface of an object by analysis of the reflected light from a laser beam which is scanned over the object surface. For further extensive details on these techniques and their use to date by collection-holding institutions, readers should consult the relevant report*
However, moving 3D techniques forward into mainstream mass digitization projects is not presently considered feasible within current resource constraints, and may not even be needed at large scale. It is most likely that specimens to be the subject of 3D imaging will be hand-picked in relation to specific research and exhibition project needs. Nevertheless, there are opportunities within the context of rapid 2D mass digitization pipelines to insert procedures that can identify suitable specimens for 3D treatment and to extract these for further special handling in a separate 3D imaging pipeline. This 3D pipeline itself should be designed to be achieve a balance of speed and efficiency whilst being as cost-effective as possible*
Method dependent digitization tasks with areas/opportunities for improving throughput
Workflow Step |
Workflow Task |
Improvement Areas |
Improvement Opportunities |
Acquisition |
Specimen handling |
Techniques and devices for specimen handling before and after image acquisition. |
Automated specimen handling and mounting can create faster end-to-end digitization workflows. |
Specimen mounting |
Techniques and devices used for specimen handling during image acquisition. |
||
Imaging station setup |
Techniques and devices for specimen scanning (rigs, platforms, lighting). |
Digitization stations can be implemented with commercial-off-the-shelf (COTS) equipment (e.g., photogrammetry) acquired from outside vendors or can combine different techniques (e.g., using images for colour texture acquisition for laser and CT scans). Optimise number of passes, determine best angles and appropriate lighting. Map the type of primary models to be created and the associated types of derivatives that can be obtained. |
|
Processing |
Scanned data processing |
Integrate and verify raw data sets. Prepare data sets for building 3D models e.g., removing outliers and duplicates. |
|
Build 3D model |
Models with higher quality increase the number and types of derivatives that can be created. |
||
Build model derivatives |
Derivatives clearly defined beforehand can streamline the curation and publishing tasks. |
Method independent digitization tasks with areas/opportunities for improving throughput.
Workflow Step |
Workflow Task |
Improvement Areas |
Improvement Opportunities |
Curation |
Model identification |
Identification of 3D models. |
Curation tasks add information on digital specimens. This can include adding data about the physical specimen, the digitization process, and the 3D reconstruction process. |
Information extraction |
Extraction of metadata from digitized specimen (from tags, barcodes, etc.) |
||
Specimen annotation |
Annotation of digital specimens, including model metadata and specimen data (from catalogues). |
||
Publishing |
3D model publishing |
Providing access to and long-time preservation of 3D models. |
The publishing processes for digitized materials must follow FAIR principles. |
3D model display |
Rendering and accessing 3D models. |
Supporting preview of the models and their associated metadata is among the goals of digitization. |
The ICEDIG work on rapid 3D capture methods has been distilled to propose a design for an ideal 3D digitization workflow*
Nevertheless, future directions for 3D imaging in DiSSCo are still dependent on the basic expectations among DiSSCo member institutions and the wider community about the requirements around and prospects for the use of 3D imaged specimens.
Recommendation 66: DiSSCo should identify and further study the scientific needs that demand the use of 3D imaged specimens, as well as the scientific opportunities opened up by the availability of appropriately imaged 3D specimens to inform future planning for introducing 3D imaging on a more widespread basis.
4.6.5.1. The definition of quality
Effective quality management is essential for producing digital data and images that are fit for purpose. DiSSCo should aim to treat quality management in a simple and straightforward manner, avoiding meaningless terms such as high-/low-quality or good/poor quality. Such terms are relative, vague and without a reference point, impossible to define. We want to treat quality objectively and in absolute terms by defining it as ‘conformance to requirements’ and not as goodness.
Against such a definition, DiSSCo should be concerned with applying quality management approaches in two fundamental areas, namely i) digitization quality management as preventative work, and ii) data quality improvement as curational work*
4.6.5.2. Digitization quality management as preventative work
First and foremost, DiSSCo must ensure that its digitization processes and other services deliver digital data (including images of specimens) fit for scientific purpose and meeting users’ needs. DiSSCo must do this without introducing new errors during such processes. Thus, an imaging process, for example should always deliver in-focus images and a transcription process must deliver accurate transcriptions; and both must relate to the requirements of users. This means properly understanding what these requirements are in relation to the purpose users need data for and aiming to satisfy those requirements. This requires proactive management, especially in relation to prioritisation of digitization (3.1) and designing processes to prevent the occurrence of errors in digital outputs. The aim is to produce the right digital specimen or other data product without defects at the right time, for the right cost whilst correctly meeting the users’ requirements. Producing outputs with defects (or producing the wrong outputs) leads to waste and rework, both of which have an associated cost.
Prevention (or quality assurance) activities are normally performed by the digitization team (digitization technicians, in-house digitization teams, and the service providers). Paying attention to prevention, continuously improving and automating the digitization process reduces and eventually minimises the need for re-digitization or for post-digitization corrective actions. This improves overall quality of the output and can lead to significant benefits in both collection management and use.
Analysis of the present capacities of collection-holding institutions (5 below) reveals that quality policies, procedures and standards are commonly lacking, minimal or ad hoc. It is an aspect of digitization that often appears under resourced and lacking in specific expertise.
Recommendation 67: DiSSCo should establish a common quality policy and standard for digitization, including adopting a prevention approach to digitization process and workflow design and appropriate training of personnel.
Philip B. Crosby’s four absolutes of quality are a useful framework for thinking about prevention in digitization. The first absolute is that the definition of quality is conformance to requirements. This means a digitization process is a ‘quality process’ when it is producing digitized outputs that are fit for the purpose the user requires. The second absolute is that the system of quality is prevention, which means digitization process designers and operators focus on preventing errors and mistakes from happening in the first place (‘do it right the first time’) rather than accepting that it’s ok for errors to creep in and to fix them after the fact. The third absolute says that the performance standard is zero defects, meaning not that mistakes never happen but that there is no acceptable number of errors that should be allowed to occur, and that effort should be put into reducing and removing opportunities for error. To achieve zero defects means monitoring and measuring non-conformances in the process and its outputs, and the fourth or final absolute is concerned with that; saying that the measurement of quality is the price of non-conformance i.e., that metrics of such measures should be expressed as a monetary cost i.e., €. Only by taking such an approach is it possible to drive mass digitization costs downwards. Crosby’s approach to quality is just one of several approaches to Total Quality Management (TQM) developed over the past fifty years that include: W. Edwards Deming’s Deming Cycle, Joseph Juran’s ‘the Vital Few and the Useful Many’ philosophy, Six Sigma, Continuous Improvement and Lean Manufacturing, among others. Although perhaps not so well known these days, Crosby’s approach has been shown to be practically effective and straightforward to implement, with the possibility to obtain impressive results.
4.6.5.3. Data quality improvement as curational work
In addition to ensuring that preventative quality management in digitization is effective (above), DiSSCo must provide mechanisms for assessing data to discover and correct errors that exist prior to digitization and/or to enhance elements after digitization takes place. The following are all examples of aspects of data quality that might be improved during curation work:
Improvement activities are normally performed by data curation and publishing teams (curators, collection managers, data publishers and data users, data aggregators). From first beginnings of establishing principles of data quality of GBIF data [
Recommendation 68: DiSSCo should adopt a conceptual framework for quality assessment and improvement of natural sciences data and should select and implement appropriate data quality tests and assertions.
There is scope for substantial further innovation in tools and processes for improving data quality, for example:
4.6.5.4. Image quality management
Digital specimens composed of images and other data can be considered, for many purposes to be viable research surrogates for the physical specimens from which they have been derived. To be suitable for research i.e., to be of research quality images must meet specific requirements/criteria in respect of a whole range of different characteristics that can include size, resolution, colour space, colour accuracy, focus/sharpness, cropping, metadata and more. Varying requirements can apply for different purposes as well as for different collection types but in general these serve to define the criteria to manage the expectations of digitization processes and guide the acquisition of equipment. Table
Overview of image elements. Legend: R = Required; C = Conditions apply; NR = Not required; O = Optional
Imaging Workflow |
Speci-men |
Back-ground |
Colour Chart |
Scale Bar |
Labels |
Bar-code |
Insti-tution Name |
Other Ele-ments |
Conditions |
Microscopy Slides |
C |
C |
NR |
NR |
C |
R |
NR |
O |
|
Skins and Vertebrate Material |
R |
C |
C |
C |
R |
R |
O |
O |
|
Liquid preserved specimens |
C |
C |
O |
O |
C |
C |
O |
C |
|
Pinned Insects |
R |
C |
R |
R |
R |
R |
O |
O |
|
Herbarium Sheets |
R |
NR |
R |
R |
C |
R |
R |
O |
|
3D Specimen Models |
R |
C |
C |
NR |
C |
C |
O |
O |
|
In several domains of heritage digitization, quality criteria (such as size, image resolution, colour, etc.) have been developed for digitization of printed materials (pictures, documents, books). To an extent these can be re-applied for collection materials (such as microscope slides, and herbarium sheets) that are close to two-dimensional (2D) representations, as well as to digitization of supplementary materials such as labels, physical registries/catalogues, and field/lab books, etc*
However, requirements/recommendations for imaging of three-dimensional (3D) objects are much more variable and less well harmonised. The current literature on quality management of 3D models is more focused on the model structure (point cloud, mesh, polygons) than on the colours and illumination of the object. Nevertheless, 2D quality management methods can be applied to the 2D source images to ensure that the derived model textures are accurate. More importantly, in the case of photogrammetry, 2D image quality methods are needed to ensure the accuracy of the photogrammetric 3D model.
Currently, image quality is predominantly verified by visually inspecting a sample of the images in a digitization batch. Besides taking place ‘after the event’, this is neither sufficient nor practical when thousands of images are produced daily.
As explained above (4.6.5.2) prevention must be practised. Imaging requirements and meeting them through accurate operation of digitization processes and pipelines avoids unnecessary work and cost later. The aim should be to prevent errors in the first place and to be able to detect when these are happening. Automated, computer-assisted checks using computer vision techniques – for example to detect inappropriate cropping, and to quantify image sharpness, contrast, evenness of lighting, colour balance, proper detection and reading of barcodes, etc. – combined with statistical process control can avoid critical process elements drifting outside established tolerances. Instrument calibration (camera exposure, scanning speed), illumination, conveyor speed, etc. can all be monitored and controlled these days.
Recommendation 69: DiSSCo should harmonise and establish quality requirements for image characteristics for common image classes expected from digitization and should aim to prevent quality defects arising in digitization processes through the use of automation, computer-vision and statistical process control techniques.
A DiSSCo DCE (3.5) might be proposed to specialise in this area.
Several case studies of automation from e-commerce and the library sector were explored as well as two examples of a robotic arm in the heritage sector to provide an understanding of the current state of robotics and automated hardware as relevant to natural science collections for digitization, handling and storage*
Warehousing automation can improve retrieval times from storage, space efficiency and climate control. However, implementation of automated warehousing solutions requires significant adaptations of existing storage space or designs for new builds substantially different from existing collection storage solutions. Automation is likely to be greater interest when new storage spaces are being built where, for example shelving units compatible with robot operations must be used.
Presently, a fully automated solution from storage to imaging and back to storage is not realistic for the complex context of natural science collections. However, by developing independent components (including storage and retrieval, transport, object picking, and imaging) that can be connected in the future, progress can already be made towards an end-to-end solution. For the imaging component, due to the great variety of natural science collections it is inevitable that multiple automated imaging systems are needed.
Recommendation 70: To work towards a future automated end-to-end digitization solution, development should focus on independent components (including storage and retrieval, transport, object picking, and imaging) which can be connected in the future.
The natural science sector will need to work with commercial partners to develop and integrate such components. This is most likely to involve working with SMEs*
If there is a real desire to use automation solutions in natural science collections for warehousing and/or digitization, then some steps are needed on the part of the natural sciences community. We propose that a series of pilot projects is established, which SMEs can participate in.
Recommendation 71: If there is a real desire to use automation solutions in natural science collections for warehousing and/or digitization, a series of pilot projects should be established, which companies can participate in and in which collection holders collaborate.
DiSSCo, and its Centres of Excellence, could play a further role in developing the expertise to better communicate with companies. Also, DiSSCo can lead a concentrated effort for research and development in this field, to ensure the various pilot projects are aligned.
Recommendation 72: DiSSCo should play a role in the development of expertise in automation, in communication with companies, and aligning efforts in automation of natural science collections.
Digitization of Natural science Collections (NHC) evolved from transcription of specimen catalogues in databases to web portals providing access to data, digital images, and 3D models of specimens. These increase global accessibility to specimens and help preserve the physical specimens by reducing their handling. The size of the NHC requires developing high-throughput digitization workflows, as well as research into novel acquisition systems, image standardisation, curation, preservation, and publishing. Nowadays, digitization workflows (and fast digitization stations) can digitize up to 6,000 specimens per day. Operating those digitization stations in parallel, can increase the digitization capacity. The high-resolution images obtained from these specimens, and their volume require substantial bandwidth, and disk space and tapes for storage of original digitized materials, as well as availability of computational processing resources for generating derivatives, information extraction, and publishing. While large institutions have dedicated digitization teams that manage the whole workflow from acquisition to publishing, other institutions cannot dedicate resources to support all digitization activities. This motivated the research into alternatives for long-term storage of digital collections. National and European e-infrastructures are viable alternatives supporting different parts of the digitization workflows. To this end, three different e-infrastructures providing long-term storage were analysed through three pilot studies: EUDAT-CINES, Zenodo, and National Infrastructures.
The EUDAT-CINES pilot centred on transferring large digitized herbarium collections from the National Museum of Natural History France (MNHN) to the storage infrastructure provided by the Centre Informatique National de l’Enseignement Supérieur (CINES), a European trusted digital repository, part of the EOSC infrastructures. The upload, processing, and access services are supported by a combination of services provided by the European Collaborative Data Infrastructure (EUDAT CDI) and CINES.
The Zenodo pilot included the upload of herbarium collections from Meise Botanic Garden (MBG) and other European herbaria into the Zenodo repository (hosted by CERN). The upload, processing and access services are supported by Zenodo services, accessed by MBG.
The National Infrastructures pilot facilitated the upload of digital assets derived from specimens of herbarium and entomology collections held at the Finnish Museum of Natural History (LUOMUS) into the Finnish Biodiversity Information Facility (FinBIF). This pilot concentrates on simplifying the integration of digitization facilities to Finnish national e-infrastructures, using services developed by LUOMUS to access the data hosted at the Finnish IT Centre for Science, CSC.
The data models employed in the pilots allow defining data schemas according to the types of collection and specimen images stored. For EUDAT-CINES, data were composed of the specimen data and its business metadata (those the institution making the deposit, in this case MNHN, considers relevant for the data objects being stored), enhanced by archiving metadata, added during the archiving process (institution, licensing, identifiers, project, archiving date, etc). EUDAT uses ePIC identifiers (ePIC 2019) to identify each deposit. The Zenodo pilot was designed to allow defining specimen data and metadata supporting indexing and access to resources. Zenodo uses DataCite Digital Object Identifiers (DOI) and the underlying data types as the main identifiers for the resources, augmented with fields based on standard TDWG vocabularies. FinBIF compiles Finnish biodiversity information to one single service for open access sharing. In FinBIF, HTTP URI based identifiers are used for all data, which link the specimen data with other information, such as images.
These pilots are described in design reports*
Each of the three services offer cloud service to access data on-line, and a long-term data preservation service where data is in a hierarchical storage accessible through grid technology (IRODS). These long-term services were not tested in the ICEDIG project.
The cloud services were tested and significant differences in data upload and download performance were found. However, the results were not carried out using standard datasets and therefore are not fully comparable. During the tests random errors were encountered in some cases. Despite that use in principle has been shown to be possible, further work is needed to evaluate these (and other) storage options.
Recommendation 73: DiSSCo should develop harmonised policies, procedures and best practices covering the different kinds of storage solution available for a wide range of anticipated data storage needs by collection-holding institutions.
Collection-holding institutions also have a requirement to be able to support long-term preservation of data sets used for research; for example, aggregations of selected specimens combined with other data sources, the intermediate and final results of analyses and details of the methods used i.e., ‘research objects. The RO-Crate initiative*
Technical advances outstrip the ability of social structures to adapt to changes. Personnel involved in collections to be digitized must be part of the process of change and must become enthusiastic and engaged; especially since collection agents are a large set of end-users of both the eventual DiSSCo research infrastructure and the data produced and stored within.
The present section firstly investigates social aspects (work, skills) of digitizing biodiversity by understanding the current practices, responsibilities and roles of the collections holders and considering the cultural differences between countries and organisations. Secondly, it considers current capacities of collections holders to perform digitization and makes suggestions for capacity building, including needed skills profiles in digitization, data management and analytics, and taxonomy. Thirdly, it explores how opening the natural science collections affects scientific knowledge exchange, collaborations, and interdisciplinary research. It will advise on applicable working methods and approaches to science, given expanding access to vast amounts of digital data.
Analysis of the results of an online survey sent by CETAF as part of ICEDIG project work to targeted natural sciences collection-holding institutions at the end of 2018 forms the basis of this present section 5 on culture, skills and capacity building. The survey results, representing a large pool (143) of participating European collection-holding institutions provide a snapshot of the current level of competencies and the need for capacity building to support collection digitization.
The collections mostly being digitized today are, unsurprisingly, the botanical collections (herbaria), followed closely by lichens and algae, mycological collections (fungi), entomological collections (insects) and vertebrate collections. Amongst the least digitized natural science collections are anthropological and ethnographic collections, archaeological collections, and documentation archives of various kinds. The type of collection or their storage method strongly influences the digitization effort (e.g., Herbarium specimens are easier to digitize than specimens stored in jars with liquid preservatives).
Documentation (books, field notes, documents and illustrations) is sometimes being digitized as part of specific departmental or collection-oriented projects as well but the libraries of collection-holding institutions and general documentary assets are also being digitized outside of those departments directly involved in natural sciences. The latter are often part of an overall institutional digitization initiative and can end up being treated as a separate digital collection.
The majority (83%) of European collection-holding institutions are active in digitizing their collections, spanning from large scale (systematic digitization of whole collections or sub-collections, 44%) to small scale digitization (digitization of a limited series of objects/specimens, on-demand digitization, 39%). Only 8% of those interviewed are currently not performing any digitization at all. Additionally, carrying out digitization of collections can mean very different things for each collection-holding institution. This goes from turning physical objects into complete digital objects with added metadata and georeferencing included, down to just creating a database record for a specimen with minimal data and without images.
Digitization is still mainly performed in-house, with about 40% performed solely by internal staff without any external assistance or support. In only 1% of the cases is the digitization effort completely outsourced to an external partner, and in 6% it was partly outsourced. Over half (54%) of digitization efforts also involve the participation of volunteers and other unpaid collaborators or temporarily paid helpers (such as job students), who can contribute in various tasks of the digitization project (e.g., transcribing label data and making images). On average, full time digitization staff members make up 36% of the total workforce, 30% are part time staff, 18% are in-house volunteers (not paid), 9% are temporary paid positions, and 6% are outsourced external collaborators.
5.1.3.1. Tendency towards on-site digitization
All the collection-holding institutions represented in the ICEDIG project consortium have significant proportions (>50%) of their collections that still require digitization. A lack of common terminology and the variance in defining the extent of digitization makes institutional comparisons challenging. In terms of established digitization workflows, all institutions can digitize herbarium sheets at small scale. This reflects the simplicity of digitizing sheets that are largely two-dimensional, digitization methodologies well established and that herbaria are a common collection type across institutes.
As noted above, most of the institutions are actively engaged in digitization at large scales, and there is a strong current tendency towards on-site digitization, even when employing the services of an external contractor. This suggests that all institutions have some suitable space on-site for mass digitization, but further conversations would be needed to assess how much those activities could be scaled up in those locations. It was also notable that outsourcing, both currently and historically, centred almost entirely on the digitization of herbarium sheets. Macrophotography setups are ubiquitous but other more expensive digitization equipment such as laser scanners, computed tomography equipment and scanning electron microscopes are less common. There is potential within DiSSCo to negotiate discounts with suppliers for bulk ordering of digitization equipment.
Opportunities should be made available to encourage wider sharing of digitization expertise by encouraging networks and disseminations channels that can be joined by digitization professionals.
Recommendation 74: DiSSCo should encourage and provide opportunities in various forms (newsletters, fora, blogs, networks, conferences, etc.) for sharing expertise and knowledge among its digitization professionals.
5.1.3.2. Specialist digitization teams
Specialist centralised digitization teams are now the norm within several institutions. They are mostly trained in either photography or scanning. These teams are sometimes trained to use other visible light imaging equipment and may have technical skills to use OCR software or development software or scripts to support digitization. As a community we will need more staff trained in these skills, along with the capability to evaluate and utilise new software and approaches to digitization.
Those institutions not digitizing yet will require training and capacity building in almost all aspects of digitization and data mobilization, especially for the skills needed to operate the specialist hardware and software, but also in fundamental ICT support-related skills such as databasing of digital data and management of the digital archives and collections. Providing data to a suitable standard (for example to aggregators such as GBIF) can be as challenging as imaging and requires training and support as well. There is clearly a need for either externally provided training or better knowledge exchange with institutions that are more experienced because of their own digitization activities.
Recommendation 75: DiSSCo must plan and implement a comprehensive training programme covering all aspects of modern digitization and data mobilization.
5.1.3.3. Internal documentation and tracking
In terms of internal documentation most of the institutions have specimen handling and digitization process guidelines that collectively cover imaging, georeferencing, data entry and transcription. While some of this documentation may be specific to an institution, there is undoubtedly value in DiSSCo collating this documentation and sharing it as best practice guidelines. This would also help to clarify terminology and definitions of digitization terminology.
Recommendation 76: DiSSCo must offer a platform for sharing documentation and best practice guidelines of workflows from the digitization projects of its Facilities, so that new projects can start faster and learn from each other, with the appropriate citation.
Barcoding specimens has become standard practice across the surveyed institutions for maintaining a traceable link between digital and physical specimens, at least for centralised, large scale digitization projects. Despite the large-scale deployment of barcodes, knowledge of the differences and the pros and cons of each of the various barcode standards*
Recommendation 77: Stop using one-dimensional barcodes and move over to using a standard two-dimensional matrix code, with standardised data content, that can be read automatically from digital label and other images .
The provenance of the specimens from the field and through the various collections to their present disposition can often be tracked by analysing in sequence the various labels and annotations made on the specimens. The value of such information can be debated, but at least it forms a part of cultural history and discussion about it could be encouraged. Tracking the location of specimens by electronic methods appears to be somewhat underdeveloped generally, although there are several good systems employed within various institutions to act as references for others. The natural science collections community has yet to routinely add provenance data to digitized specimens. Making such documentation more easily findable, accessible and part of specimen metadata would make specimen tracking much easier.
More so than cultural factors, the institution size, collection types to be digitized and the availability of digitization staff all play important roles as potential limiters of digitization. The cultural background or the country in which the collection-holding institution is located appears to be neither determinant nor decisive for the decision-making towards digitizing. However, there is one facet where cultural aspects can become a factor to consider. This has only been observed when referring to volunteer work and citizen science support. In northern and western Europe, there is a long history in volunteering, and this has become a sign of progress and involvement among the civil society*
Speaking generally, the more mature the collection-holding institution is and the more economically and socially developed the location country is, the larger the external commitment towards digitization appears to be. Large institutions in developed countries are preceded by their fame and reputation. They can count on a bigger and more stable external workforce that become attracted, among others, by the size of the collections, the easy access for collaborative work, and the possibility to gain credit for work done. This is also promoted by the collection-holding institution’s themselves through their communication and dissemination campaigns that can be stronger and more impactful locally. This increases motivation and engagement of volunteers.
Recommendation 78: DiSSCo should assist its collection-holding institution members to develop and strengthen their external profile (marketing) with funding agencies, professional and citizen scientist groups and local communities appropriate to their location and sphere of collections related operations (i.e., research, education and exhibition).
One third (36%) of digitization staff receive on-the-job training. Within this already small group, half of them only receive training initially at the beginning of their digitization activities without follow up or additional training to stay up to date. Only a minority (10%) receive training multiple times per year.
When training is provided, it covers skills in basic digitization work, such as handling of collections, use of digitization hardware (cameras, computers…), utilisation of digitization software and the use of a collection management system. About half of the trainees also receive training in more advanced aspects such as data management and archiving. Less than 20% of respondents receive training that deals with related critical activities such as ICT support, working with volunteers, and new developments in digitization. Finally, only 5% learn about implementation of automation and robotics. Specific aspects of training currently missing in many cases are knowledge and skills in technological aspects, working with shared or common data references and identifiers, and skills in manipulating and working with physical collection objects.
These results suggest that many staff employed to perform digitization tasks are not adequately trained to do so. This observation is supported by many respondents’ answers provided verbally, reflecting their own self-assessments. Lack of training to effectively operate the hardware and/or software used to digitize collections, basic general digitization skills and associated elementary collection actions such as handling specimens or a rudimentary knowledge of taxonomy and interpretation of taxonomic data are all missing from a large proportion of staff involved in digitization. There is an overall and urgent necessity for both general and specialised training for staff covering all aspects of digitization at all complexity levels and an urgent need for professionalization of such work.
A different approach was taken by Digitarium*
Recommendation 79: DiSSCo should organise a training curriculum for its member institutions covering: i) technological aspects, such as features and operation of equipment and software; ii) standards, i.e., museum and archival practices including data standards, in particular unique and persistent identifiers; iii) efficient digitization workflows in various situations, including quality management; and iv) for museum leadership .
Often mentioned, insufficient funding is the biggest limitation for digitization, principally restricting the ability to hire, train and keep experienced staff. Funding when available is often mostly project based and is rarely enough to hire staff that are either highly qualified (and who could then act as trainers to others) or that are fully dedicated to digitization for prolonged periods of time. Additionally, there are also insufficient dedicated funds for those costs of digitization that are not related to staff, such as the acquisition of the necessary technical equipment and the use of the most up-to-date and sophisticated software (which is often only available commercially). About 30% of the survey respondents mentioned these latter shortcomings.
Currently therefore, digitization activities are most often relying on external funding, which suggests that digitization is a priority for allocating core budget in collection-holding institutions only in a limited number of cases.
However, despite these apparent limitations there are opportunities, especially using European Structural Funds*
A national centre of digitization expertise in Finland, Digitarium was launched in 2010 and operated until 2017, funded by a series of grants from the European Structural Funds. Totalling €2.1 million, this funding covered 70-80% of costs, with the remainder coming from the host city and the two participating universities. The funding was used to build the technological base for mass digitization as well as the human capacities. Additional funding of €2.0 million was obtained from EU FP7 research projects, national research infrastructure projects, and commercial mass digitization services. This model of funding in eligible parts of Europe can be attractive for DiSSCo as a means of establishing the needed digitization factories and Centres of Excellence. It is worth noting that the ESFRI LifeWatch Research Infrastructure is being largely built on a similar basis.
Recommendation 80: DiSSCo institutions should look for opportunities to use EU structural and investment funds to build up the digitization capacities in eligible countries and regions. DiSSCo should centrally support this activity with application packages and support for proposal writing targeted specifically for these funding sources which are not research oriented but aim for economic and social development.
Mass digitization and Digitization on Demand have been described (3.9, 3.10) as processes (activities) both taking place today and foreseen to take place at larger scale in and across the collection-holding institutions that make up DiSSCo. Beyond one-off funded projects (albeit lengthy ones in some institutions) when does digitization become ‘business as usual’ and what is needed to make that happen? What is meant by digitization business as usual? What is ‘digital by default’? Indeed, how will digitization adapt and look like in the future as new innovations are introduced, as funding models changes, and as organisations collaborate more and co-organise in pursuit of digitizing and digitally exploiting Europe’s 1.5billion specimens?
Many museums are already organising or beginning to organise their digitization beyond specific projects by establishing specialised functions and teams. LUOMUS (Finland), for instance has two teams (botany, entomology) for mass digitization. Naturalis (The Netherlands) has coined the term “permanent digitization infrastructure” as everything required within an institution dealing with digitization on a day to day basis e.g., the digitization of specimens sent out on loan, a type specimen that needs imaging, volunteers that digitize a private collection, etc. That doesn’t include mass digitization and digitization-on-demand. In other collection-holding institutions, hi-resolution digitization (e.g., of specimen parts) and selective digitisation (e.g., of type specimens) becomes more focussed in specialised units with appropriate expertise and equipment.
Each collection-holding institution has a different balance between what it considers to be ‘business as usual’ and what is done by central teams, functions or projects as a separate workflow. Working collectively towards the DiSSCo aim of digitally unifying all European natural science assets under common curation and access policies and practices that make the data easily FAIR is bound to bring further and sweeping organisational changes and these should be studied to identify best practices.
Recommendation 81: DiSSCo should investigate and promote best practices for operating models within collection-holding institutions whereby digitization becomes business as usual and digital by default.
To determine the effects of opening natural heritage collections to wider access by means of digitization, we here consider the effect on collaborations and research (5.2.1), on the mobility of collections in terms of loans and visits (5.2.2), and on education, citizen science and private collections (5.2.3).
When considering the real need for digitization of natural science collections, one important argument is that the collections would be opened for a far broader usage, not only for the community of natural sciences researchers but for society at large.
In cases where digitization has already had a noticeable effect, that is considered overwhelmingly to be positive or very positive. In general, digitization broadens the impact and strongly facilitates many aspects of collection-based work, both internally and externally, with positive effects attributed to one of the following categories:
Recommendation 82: DiSSCo should deploy metrics (key performance indicators) to monitor impact and progress in collaboration and research facilitated by digitization and should publish the results annually.
One possible side effect of this success story that should not be ignored however is the changed workload (perhaps increased) on the shoulders of the personnel conducting digitization and those responsible for arranging and shepherding access to physical collections, which may increase as a consequence of increased external visibility and accessibility
Opening access to the collections also influences the mobility of collections in terms of the numbers of arranged loans and visits. Half of the interviewed respondents noticed there is a difference in terms of the number of loans and visits when comparing the current situation of (totally or even partly) digitized collections to times before digitization began. The effects are the following (expressed in rounded-up numbers, with the other 50% noticing no effects):
The very broad distribution of responses in terms of effect correlates with the completeness of the digitization as well as to the type of data provided (e.g., with or without images) in each institution. Collections that are fully digitized may experience fewer loans and visits, since researchers can check in advance if the collection is worth visiting and what individual specimens of interest are stored in the collections, and can often use the digital data e.g., for identification. The same is true for collections with an image even if they are not fully digitized. Still, the need to send material on loan seems to decrease. For collections that are available online, but without an image, both loans and visits seem to increase, indicating the need to verify the reliability of the available information, e.g., to check if the provided identification is correct, and the impact of saying what is in collections without providing further details (I.e. awareness increases but the digital data are not research-ready by themselves).
Additionally, when the entire collection is digitized, the requests are also more specific. When only the type specimens are digitized, there are more requests for specimens (more specimens per loan), but this does not impact the overall amount of loans or visits. Finally, digitization-on-demand is regarded as a good alternative to loans of physical specimens when no further examination of the specimen is required (e.g., dissections, taking detailed measurements, etc.).
Digitization, perhaps particularly when associated with other changes such as collections moves, can be an opportunity to revisit institutional processes more broadly. Given the uncertain impact of digitization on physical access and loans, DiSSCo may wish to pilot ‘digital by default’ access approaches which focus on digitization-on-demand first, followed by physical access or loans only where the digital data cannot meet the research need. DiSSCo may also wish to put in place metrics that show how digitization is spreading across all collections in a museum, as a new standard of practice.
Recommendation 83: DiSSCo should deploy metrics (key performance indicators) that show how digitization is spreading across collections and to monitor the changes in mobility and usage of collections. The impact of digitization should be assessed on a regular basis.
Collections are an invaluable resource for education, both for formal and informal education. Formal education is usually directly supporting school curricula and taking place on the school premises while informal education can take place virtually everywhere and is not bound to school curriculum.
As illustrated in Fig.
Digital content of natural science collections (images, maps, etc.) is already actively used in informal education*
Recommendation 84: As well as providing access to Digital Specimens for research purposes, DiSSCo should consider the additional and different aspects that can pertain to providing access for educational purposes.
Although private natural science collections play an important role in building material evidence for taxonomy worldwide, they go largely unnoticed as a form of citizen science, whereas observation-based initiatives such as iNaturalist gain high visibility and impact. Representation of collection-based projects in citizen science portals is rather modest*
Recommendation 85: DiSSCo should actively promote the role of private natural science collections as a form of citizen science.
These portals, of which there are many, often utilise metadata of citizen science projects for their search engine backbone. However, these search engines are missing appropriate data fields that would allow discovery of projects based on private collections. This is most likely because the development of relevant metadata standards by, for example the European Citizen Science Association (ECSA)*
Recommendation 86: In working to promote private natural science collections as a form of citizen science, DiSSCo should take the lead to ensure that the metadata definitions needed to make private collections more publicly visible becoming incorporated into appropriate citizen science metadata standards, such as PPSR-Core.
In current institutional practice, staff are normally organised around fixed roles, each having specific tasks attributed to them. Examples of such roles are ‘collection curator’, ‘collection technician’, ‘digitization agent’, etc. However, to properly undertake digitization activities according to this template means cooperation is needed between actors holding different roles. As the means to introduce a greater flexibility in the management of skills and competencies within the digitization process, we suggest the idea of working with so called functional units rather than with traditional roles. A functional unit (in the sense of organising competencies) is a collection of competencies needed to perform specific tasks within the different steps of the digitization process. We believe this approach can allow a more fluid distribution of tasks among digitization personnel. Staff often already think in that direction, shown by the fact that 68% chose to identify themselves with more than one role during the survey. A tentative structure for digitization related functional units has been suggested (Appendix 1 in ICEDIG MS49 report) including a proposed list of tasks that can be attributed to each potential functional unit. In a small institution one person could be qualified for, and combine, tasks from one or more functional units while in a large institution several people could be capable of performing several tasks within one functional unit. How functional units are distributed amongst the staff can depend on different factors, such as the capacities of the institution (i.e., number of available staff), or the desired scale of digitization to be undertaken. In some institutions, especially the smaller ones with a limited number of staff, some people are already performing tasks that transcend their habitual role and combine tasks that would otherwise belong to people who hold different roles. This phenomenon was clearly supported by some of the survey respondents who consider this to be common practice within their own organisation. It is, for example, possible that the collection curator also performs tasks that can be attributed to a collection technician and/or a digitization operator.
Once the DiSSCo consortium is fully implemented, collections-related work could evolve from a research institution-based approach to a distributed research infrastructure-based approach. This means that while currently most digitization activities are taking place in the context of single institutions, with DiSSCo this could evolve and increasingly include more coordination between institutions and countries. Better coordination of digitization strategies and priorities would efficiently create the best possible digital collection of bio and geo-diversity specimens.
This would involve new and improved unified methods to tackle the challenges of collection management and digitization of collection objects and their associated data. It could free organisations from following the strict definition of roles and could act as a facilitator for the mobility of work forces and researchers as well as the creation of international competence groups. At the very least, it would provide institutions a common vocabulary when discussing collaborative actions, common projects or very specific things like comparing annotation workflows and when hiring / or designing professional development.
This is a concept that falls in the realm of Business Process Re-Engineering which has been practiced in the commercial sector for about two decades. Introducing a new European RI should certainly introduce some new and harmonised working and business practices across the DiSSCo membership that will need further discussion and could only be developed so far in the context of ICEDIG. How the work in each department will be organised will of course always remain at the discretion of each collection-holding institution.
Recommendation 87: DiSSCo to develop a strategy for aligning and unifying the work practices across its facilities.
5.3.2.1. Interoperability of Collection Management Systems with DiSSCo Hub
Collection Management Systems (CMS) are database systems used by collection holders to keep track of the data they possess about their specimens. As the primary data sources for the DiSSCo Hub, they will play a critical role in the infrastructure. Specimen data will have to flow efficiently and in a standardized way from those systems to the hub, where the data can be consulted alongside data from other sources. This way, interplay with other data is also possible for enrichment and validation purposes. After such enrichment and/or validation services have been applied, data should also flow back seamlessly into the CMS to maintain the CMS as the central authority.
Surveys indicate many different CMS solutions are presently in use. Some solutions are bespoke, in-house developments, while others are commercial products or the result of national consolidation initiatives. Small collections may not use a dedicated system at all. Migration from one system to another is a time-consuming and difficult process, made worse by an imprecise picture of which systems are available/in use and how they compare to each other. A landscape analysis of potential systems is needed, with a list of minimal specifications – at least those related to the needs of the DiSSCo Hub. Local needs and preferences may be harder to generalize. This will facilitate migration from outdated systems, but also the adoption of a first CMS by smaller collections. It will also discourage the development of further bespoke systems, which is to be avoided.
Recommendation 88: DiSSCo should prepare a minimum specification of an ideal collection management system (CMS) and select/recommend preferred alternatives from the available product solutions to meet member institutions’ various needs.
Some recommendations for CMS developers were drafted in ICEDIG project deliverables D4.3*
A further obstacle to efficient data flow from the DiSSCo Hub back into local CMS is the problem of trusting these new or updated data, in conjunction with the lack of a proper version history mechanism in most CMS. Most CMS are also unable to systematically store annotations relating to specific records or specific data fields of records. The overhead of dealing with multiple versions or annotations to records remains a separate problem to solve, but data should not get stuck in between different nodes of the pipeline.
Some CMS aim to cover a wide range of data types, while others are tailored for very specific needs. The institutional organization also has an impact, as libraries are often managed separately from biological or geological collections. Because of this, they use different systems and even different standards, despite the potential links that exist between books and specimens, or the overlap in certain types of data such as those related to people. For technical reasons, multimedia objects of specimens are also managed in a different manner than specimen data. Multimedia objects suffer from similar problems as data, such as versioning, proper use of identifiers and interoperability with media from other sources. A push for better standardization might be needed here as well, using for example “triple-eye eff”*
Recommendation 89: DiSSCo should adopt the “triple-eye eff” (iiif.io) image interoperability framework as its basis for media management and interoperability.
Finally, not all data are currently published from CMS. This is often related to (perceived poor) data quality or availability, as well as sensitivity of data relating to threatened species or valuable specimens. Some data are not published as they fit poorly in the existing exchange standards – or not at all. This is frequently the case for verbatim renditions of data, which are typically considered obsolete after transcription and interpretation, but which nevertheless still have various viable use cases (Table
Use Case |
Examples |
Application notes |
|
1 |
Facilitating data cleaning and indicating the degree of interpretation in the standardized fields. |
Dates that are found to be unlikely or impossible can be easily checked for typos or erroneous transcription. |
If a digital image of the label is available, then there is less need to check a verbatim transcription for validation. |
2 |
Discovering information hidden in the typography of how text is presented on the label. |
The syntax of person names can be a clue to the writer’s identity and for linking related specimens. |
This is unnecessary for most specimens but is valuable for enriching poorly documented specimens. |
3 |
Increasing the findability of specimens. |
Where a word, such as a place name, can be read but not understood, then the text can still be found. |
Original text can be searched in the original language. |
4 |
Accommodating partial or uncertain transcriptions, which would otherwise clutter standardized, interpreted fields. |
The use of square brackets ([]) and ellipses to indicate uncertainty or a failure to read part of the text. |
Other transcribers can build on the initial attempt, and it will be clear that the information is present on the label. |
5 |
Providing training and validation source data for automated text capture methods. |
Automated reading of 19th century handwriting and recognition of symbols used on labels. |
Finding gold standard training data for algorithms is a common problem. |
6 |
Accommodating data that are not sufficiently standardized for the interpreted field or that fail to comply with the restrictions of the interpreted field. |
Dates that lack a year or data awaiting interpretation. |
It is common to find verbatim fields containing data in non-standard formats, yet they are not transcribed data either. |
7 |
Accommodating data following obsolete or bespoke standards. |
Grid system location codes. |
When a database is migrated from one system to another, then verbatim fields are used to store old formats. |
8 |
Preserving the original language when interpretation has included translation. |
Habitats can have some very specific meanings in different languages, and they are difficult to translate because there may not be a direct equivalent. |
This also improves the findability of specimens written in a different language. |
5.3.2.2. Support for automated data capture in Darwin Core data standard
The Darwin Core (DwC) data standard*
5.3.2.3. Dealing with blank fields in Darwin Core data standard
Currently, little regard is given to unknown and incomplete data in Darwin Core biodiversity records. It’s not possible to tell whether a blank field is blank because the data is not available, is not known, has not been digitized, has been withheld or is simply missing because someone didn’t bother to fill the field with a value. Digitized specimen label data are the result of a complex and multifaceted digitization workflow. This workflow is often focused on core information of the specimen label and much data from specimens remains to be digitized. Curators and those working directly with specific collections/specimens often know when data exist but are undigitized. For the wider user community this can be important information, though it is rarely communicated. In deliverable D4.3*
5.3.2.4. Data about people
Data about people are one of the most overlooked components of specimen data. Such data are generally represented only as character strings with no specific format. Nevertheless, in recent years unique identifiers for living researchers and authors, such as ORCID identifiers*
Recommendation 90: DiSSCo should encourage the use of unique persistent identifiers for people collecting and working in collections.
5.3.2.5. Geography
The geographic location where a specimen was collected is a very important piece of information. It allows the specimen to be associated with other data resources tied to location, such as data related to climate, vegetation, soil and other environmental variables. It is also key for species distribution modelling. Location data can be modelled in different ways and at different hierarchies, but specific point coordinates are commonplace nowadays. Coordinates are always tied to a coordinate reference system (for example, latitude/longitude in the WGS84 system*
Georeferencing of a specimen can also be performed post hoc for historical specimens, based on interpretations of locality descriptions or information noted down in field notebooks. Biological observations in the 20th century were regularly made with the aid of national grid systems. These can be converted to coordinate point data with an uncertainty radius, but this distorts the initial extent of the uncertainty. It also causes extra work and further inaccuracy can be introduced because (as in ecological research, for example) data are often transformed into grids again for modelling purposes.
Names of geographic locations suffer from similar problems as names of people: they can change, they can be different in other languages, they may need to be disambiguated, etc. Hence, here too the use of persistent identifiers linked to formal geographic name descriptions (such as those to be found in GeoNames*
Recommendation 91: DiSSCo should encourage the use of persistently identified geographic name descriptions from recognised sources.
5.3.2.6. Data migration
Data migration is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another. Additionally, the validation of migrated data for completeness and the decommissioning of legacy data storage are considered part of the entire data migration process.*
As noted above, migration of data from one CMS to another is a difficult and time-consuming process. Yet it is one that must be undertaken as part of the overall rationalisation of CMS solutions in use across the DiSSCo collection-holding institutions if levels of support and training to institutions in this area are to be improved as part of the DiSSCo programme.
Another kind of data migration occurs when collections of specimens are moved from one institution to another. This is best illustrated by an example of a collection of herbarium specimens that are presently being digitized. Most of these specimens carry a 1D barcode attached by the herbarium originally the collection. This identifies only a catalog number. No institutionCode nor collectionCode appears. Many of the specimens in this collection also carry the barcode and catalog number of a second institution, reflecting a transfer of ownership sometime in the past. The present digitization process includes attachment of a new QR code label to each specimen that contains a universally unique identifier (uuid) with a purl.org prefix. Thus, already 3 identifiers, with potentially similar records in 3 different database systems. This is a problem that needs attention urgently in terms of standardising digitization procedures for the future such that future difficulties are reduced.
5.3.2.7. Identifying data with CETAF Stable Identifiers and Natural Science Identifiers (NSId)
CETAF Stable Identifiers*
Recommendation 92: DiSSCo should encourage the further adoption of CETAF Stable Identifiers for the local and persistent identification of physical specimens.
The proposed Natural Science Identifier (NSId) scheme*
Like in the journal publishing and film/TV entertainment industries, this identifier scheme is independent of the underlying assets to be identified and their owning organisations, as well as being independent of the specific technology of the World Wide Web.
The new NSId scheme to identify Digital Specimens*
Recommendation 93: DiSSCo should identify benefits from and opportunities for third-party, value-added services arising through adoption of a Handle-based persistent identifier scheme for Digital Specimens, presently proposed as Natural Science Identifiers (NSId).
DiSSCo’s requirements for persistently identifying Digital Specimen and other object types are set out in Appendix C.
5.3.3.1. Costing as a new practice
Keeping records of costs is a new practice that should be adopted. It should be possible to account for:
Optimal digitization cost is achieved when the volume and availability of specimens ready for digitization matches the capacity of the digitization facility. Having enough specimens ready means the digitization capacity can be effectively utilised and the highest throughput can be achieved, thus leading to the lowest cost (notwithstanding other factors contributing to cost and the assumption that the digitization facility is dimensioned sufficiently for the task). Too few specimens ready means the capacity is underutilised, meaning higher cost per specimen.
What an institution wants to know is: When can certain kinds of digitization be achieved for specific levels of investment? When does it become practical/economic to start digitizing a collection? What does it cost to invest for digitization and to reach a certain level for a collection e.g., one of the MIDS levels or dashboard goals? How much is it going to cost to maintain? Gathering cost information begins to inform answers to such questions.
Work in the ICEDIG project*
Recommendation 94: DiSSCo must evaluate, adopt and support modern alternative(s) to traditional spreadsheet approaches for gathering, collating, and analysing cost information and for budgeting and management of DiSSCo costs.
Different currencies (sterling and euro) have been used during the cost gathering and analysis work in the ICEDIG project. To make comparisons between gathered costs reasonable, an artificial currency (‘Purchasing Power Standard’, PPS*
Recommendation 95: In cost gathering, analysis and reporting, DiSSCo should convert, at the time of data entry from the currency of data entry to a standard currency for analysis and comparison purposes.
5.3.3.2. Costs versus charges
Costs must be treated separately from charges. A cost model is not the same as a charging or business model and cost calculations cannot be considered in isolation from a business/charging/organisational model, because of the influence of DiSSCo governance decisions and policy on requirements for digitization, data access and availability. Digitization can be required to a certain level. Some data may be more immediately available than other data, according to scientific demand and difficulty to retrieve (faster and easier versus slower and more time-consuming).
Analysis of potential business (and thus charging) models is tied closely to questions of DiSSCo organisation and governance (see 6). Nevertheless, it is likely that the DiSSCo business model should use the fundamental assumption that data must be ‘free at the point of use’ i.e., at no charge to the ultimate end-user. Within such a constraint, various charging models are conceivable, including for example: i) a research subscription model, whereby an institution or project wanting to provide its research staff with access to digital collections data pays a subscription for DiSSCo user membership; ii) an open-access model, whereby those demanding digitization of specimens pay for that e.g., through a funded digitization programme; iii) an extension/re-orientation of the current loans and visits model, whereby costs of organising loans/visits are re-allocated providing FAIR access to digital content; and iv) obtaining money from industry in exchange for “free” data.
Any business model must, however, take both depreciation of capital equipment and amortization of intangible assets i.e., data into account, such that these costs can be accounted for and recovered over the long-term.
Recommendation 96: The DiSSCo business model must take depreciation of capital equipment (tangible assets) and amortization of intangible assets (i.e., DiSSCo data) into account, such that these costs can be accounted for and recovered over the long-term.
Most personnel concerned with digitization in collection-holding institutions have a positive attitude towards the future and for the continuation of their current digitization efforts, with the hope to implement improvements and to upscale them. Nevertheless, current digitization efforts are mainly project based and depend almost entirely on volatile and temporary funding sources. In some cases, especially in the smaller institutions, the current situation is dire and digitization efforts are carried out by too few people who are involved in too many other tasks and roles within their institution. For them, without more dedicated effort, digitization of entire collections is either impossible or progress will soon come to a halt until a new project or opportunity is found. For those working in an collection-holding institution currently undertaking digitization with enough supporting resources (facilities, equipment, personnel and, more importantly, with a strategic vision) the effort needs to continue to complete the digitization of their collections, or even going further by tackling additional challenges such as the addition of collections outside the collection-holding institution (private collections, etc.).
Several crucial aspects must be taken into consideration when addressing the needs for capacity enhancement and for providing proper skills to the people involved in digitization. Still, it remains difficult to draw uniform conclusions a very diverse pool of institutions is involved, some of which have the means to perform large scale digitization and even outsource parts of the digitization activities, while others are very small collection-holding institutions with sometimes only a handful of people in service (occasionally just one person who combines all digitization tasks), trying to build up digitization actions with the little means they have.
Four priorities for capacity building have been identified:
For these, a capacity building pipeline is needed, and DiSSCo and institutional leadership must plan for that.
Recommendation 97: DiSSCo and institutional leadership must plan for capacity enhancement in i) training of digitization and allied personnel; ii) funding to hire digitization personnel on long-term and for the effort in general; iii) increase in dedicated digitization staff; and iv) investments in modern efficient equipment, software, CMS and data solutions.
Alongside these, there is a general need for standardisation and the systematised sharing of best practice and common approaches. These issues are relevant to both institutional, and individual staff, capability. To support institutions to understand and to improve their capability, DiSSCo should provide a self-assessment tool that identifies levels of ‘digitization readiness’ (including data mobilisation) i.e., a digitization maturity assessment model. This would enable organisations to understand their own needs and what next steps they might take, and to support their case for these steps e.g., to funders. For individuals, DiSSCo will need to continue work to identify digital skills and competences which can be matched to roles and functions as appropriate, setting out a framework and tools to help individuals to identify their own levels of capability and their development and training needs i.e., competency matrix and skills profiles*
Recommendation 98: DiSSCo should create focus on harmonised tools and frameworks to help institutions and individuals understand and develop their skills capabilities, needs and professionals (such as: digitization maturity assessment model, competency matrices and skills profiles, career paths) and should make the case to address these with each collection-holding institution.
Following the priorities detected above, participants in the survey identified needs for supporting activities, with capacity enhancement and training actions among them, that would help to increase the impact of collections digitization in science and society. These activities comprise an enhanced involvement of citizens that can contribute as citizen scientists or volunteers and a more targeted training in addition to the general training for the professional staff and volunteers involved in digitization activities to address possible shortcomings.
There is a strong advocacy to strengthen interdisciplinary research and to improve general public engagement by making the digitized data available on openly accessible and user-friendly platforms.
Training must be provided on a regular basis. Current training in digitization is generally inadequate, in terms of frequency, content, availability, professional credit and recognition (e.g., only 50% of staff receive initial training at the start of the job and even less enjoy recurring training). There are still substantial gaps in capacity building actions today and training at regular intervals is almost unheard of, which diminishes dramatically the chances to have staff working at leading edge with the state-of-the-art techniques. Keeping track of rapidly evolving techniques and acknowledging global standardisation endeavours require focused and expert training beyond the individual institutional approach, from mature organisations and highly experienced trainers. Even more critical is the fact that digitization staff often doesn’t have the adequate background to effectively operate the hardware and/or software that is used to digitize collections. Some of the digitization staff are also not trained enough in general digitization skills, even on a very basic level. There is an overall and urgent necessity for increasingly specialised training for dedicated staff that should cover all aspects of digitization at all complexity levels.
Training needs vary significantly from one institution to another and cannot easily be categorised since a very broad lack of digital and to a lesser extent, analogue skills has been detected, with data skills standing out as a specific need. This widespread need for capacity building not only impedes digitization efforts but also prevents innovation in this area.
Before that, collection-holding institutions need leadership who understand the care and feeding of data (for longevity, re-use) and can plan cyber/human infrastructure needs and training accordingly or they must have access to experts with that knowledge.
Recommendation 99: In alliance with an appropriate training provider, DiSSCo should develop and promote executive/senior level training in collection-holding institutions, with a specific focus on collection leadership, mobilisation and use in the digital information age.
When DiSSCo is implemented, collection related work is likely to evolve from an institutionally based approach to a distributed infrastructure-oriented approach. This means that while currently most digitization activities are taking place within single institutions, with DiSSCo this could evolve and at least include coordination between institutions and countries regarding digitization strategies and priorities to efficiently create the best possible European digital collection. This would involve new and improved unified methods to tackle the challenges of collection management and digitization of collection objects and their associated data. A working approach based on functional units could help to streamline the digitization efforts and improve efficiency by reorganising the competencies in respect to the existing capacities per institution. It could free organisations from following the strict definition of roles and could act as a facilitator for the mobility of work forces and researchers as well as the creation of international competence groups. At the very least, it would provide institutions a common vocabulary when discussing collaborative actions, common projects or very specific things like comparing annotation workflows.
To tackle this challenge, DiSSCo institutions must look beyond their own realm. Finland approached this challenge in 2010-2017 by establishing a national digitization centre funded by the European Structural Funds (ESF). The operation had to change its modus operandi several times but was largely successful.
Recommendation 100: DiSSCo institutions should form consortia that consolidate activities by launching national or regional centres for large scale digitization and offer out-sourced digitization services, training and other capacity building for in-house digitization at the institutions. These could be funded through European structural and investment funds.
Recommendation 101: DiSSCo centrally should launch thematic DiSSCo Centres of Excellence that support regional and national centres with technological innovations needed to ramp up the speed of digitization to the required levels of output. Such Centres may or may not be connected to digitization consortia/factories.
Looking beyond Europe, such an approach might also be developed in conjunction and/or alignment with institutions in other regions (e.g., USA), helping all to better develop the 21st century workforce.
Finally, there is an urgent need for parallel initiatives around i) further awareness raising and training, and ii) development and promotion of pilot applications and exemplars as a means of convincing those that need to engage and participate to DiSSCo of the value of doing so. The former would introduce DiSSCo to newcomers and assist participants to better understand the existing plans and implementation strategies and their potential benefits for users. The latter, through the availability of working examples of good solutions (as has already been shown with the ICEDIG Digital Specimen Demonstrator) greatly accelerates understanding and helps to convince future users of the potential benefits. The latter could include ways to help institutions with small collections to get started on their digitization journey. Both activities must contribute to showing how DiSSCo can support the research goals of individuals and how DiSSCo can actively support and enhance the work of specific stakeholder groups.
Recommendation 102: To communicate and demonstrate the value of participating in DiSSCo, the DiSSCo Coordination and Support Office should initiate: i) further awareness raising and training, and ii) development and promotion of pilot applications and exemplars. Both activities must contribute to showing how DiSSCo can support the research goals of individuals and how DiSSCo can actively support and enhance the work of specific stakeholder groups.
Such activities are important also to refresh and deepen DiSSCo’s understanding of requirements (and how those may be changing over time) and of the practices and activities of individuals working with the data of natural sciences collections.
By entering the ESFRI Roadmap in October 2018, DiSSCo initiates its Preparatory Phase (PP); a period that will conclude with the formation of the legal entity of DiSSCo. The PP governance proposed in the present document is considered here as part of a succession of models for each programme phase, as illustrated in Fig.
The Preparatory Phase is entirely oriented to guarantee DiSSCo reaches maturity to be constructed and fully operational afterward. Readiness in five specific dimensions (data, technical, financial, governance and scientific) demands a governance structure able to respond to new demands and requirements during the preparatory phase while guaranteeing a smooth transition towards a country-based research infrastructure.
It becomes necessary that the Steering Committee, the governance model of DiSSCo that sufficed during the first years (up to 2019) moves in 2020 to a more democratic and inclusive form of governance, a General Assembly (GA), that levels the playing field for national consortia involvement in the developments of the preparatory phase while securing the sustainability and operation of the Coordination and Support Office (CSO) activities.
It is necessary that this interim governance model guarantees the financial sustainability of the preparatory phase, anticipating the possible negative effects of the potential critical funding path (see 6.3.1 below) on the sustainability of the Coordination and Support Office activities and consequently, its direct impact in communication, engagement, the international positioning, the scientific and technical missions, service deployment & others domains.
Additionally, the interim model must ensure excellence in the way DiSSCo achieves all matters related to the scientific and technical mission and service objectives. In doing that, it is necessary to constitute several independent advisory bodies (6.2.4 below) to provide expert consultation services. These include the Scientific Advisory Board (SAB), a Technical Advisory Board (TAB), a Funders Forum (FF) and a Stakeholders Forum (SF).
During the Preparatory Phase, the research infrastructure will constitute an institutions-based General Assembly (GA) that advises and takes decisions on the implementation of the tasks defined through a portfolio of interlinked projects. At that stage, it is necessary the model guarantees the continuity of the community active in running the projects and other activities while stimulating the creation of new national consortia in those countries that have not set up one yet.
The new Preparatory Phase governance model pursues the empowerment of DiSSCo national nodes that progressively will take a key role in the engagement of DiSSCo country-level funders. The complexity of the task will be partially pared down by the Coordination and Support Office’s actions, including the formulation and implementation of relevant consulting bodies, the Funders Forum and the Stakeholders Forum.
Those forums, meant to be part of the DiSSCo Prepare project, will constitute key instruments for the sustainability of the research infrastructure and its position at national, regional and international scale during implementation. The way how these two bodies engage and operate will be crucial to reach successfully the implementation phase. In that sense, there are open questions, mostly related to engagement and operation.
DiSSCo will need to establish effective ways to engage authorities at this stage when DiSSCo is still an institution-driven research infrastructure and the bodies do not have any decision-power. The research infrastructure will need to be able to prove a level of maturity, at technical and governance level, which is unusual in such an early stage. At the same time, DiSSCo will need to develop a sound engagement with existing initiatives of strategic interest at national level, in order to attract the attention towards the potentialities of the services that will be provided in the future.
For operation, it will be necessary to decide which are the necessary instruments for the operation of these two bodies that guarantee the achievements of their goals.
Tasks definition, coordination with other advisory and decision- bodies to achieve real contributions and mechanisms to communicate the content of the discussions to a realm of different national realities have still to be defined.
Furthermore, the new governance model will need to guarantee that the nodes represented receive the support needed to adopt the best infrastructure possible for the national interest. Clear targets and guidelines on requirements and best practices in advocacy and communication are still to be defined.
A third domain that will need an equal investment of resources refers to scientific and technological excellence. How to position DiSSCo in an existing fast-paced environment at scientific and technical levels will be a main challenge and requires both, scientific and technical guidelines and, a trustful stakeholder network. It will be necessary to develop an engagement strategy beyond the environmental domain, investing in participation in international fora, collaboration in transdisciplinary projects, knowledge exchange and, above all, the independent expert guidance of the Scientific and Advisory Boards (SAB & TAB).
How the recommendations of the SAB and the TAB percolate down the different levels of execution and expertise across different activities (i.e. synchronisation groups, technical team, etc.) should be analysed to guarantee the alignment necessary to guide DiSSCo future developments.
Last, but not least how the actions towards positioning DiSSCo benefits from the interactions in the Stakeholders Forum and contribute to the SAB/TAB discussions, should be also investigated to build a communication that conveys key messages in both directions.
6.2.2.1. Scope
The General Assembly (GA) will be the decision-making body for DiSSCo during its preparatory and transition phase which means the period starting with the admission of DiSSCo onto the ESFRI roadmap and ending with the creation of a separate legal entity for DiSSCo. The financial contributions will define the levels of participation at the General Assembly. The mode of operation of the new body requires specific terms of procedure.
The GA aims to be a more inclusive form of governance and to provide active participation to all its members. Comprising representatives from all DiSSCo consortium (MoU) members, it provides the forum for multilateral discussions on DiSSCo developments during its Preparatory and Construction phases and the necessary policies and strategies to be implemented in the early stages of the Construction and Operation phases.
6.2.2.2. Membership
Different types of membership will be established, depending on the possibilities of members to contribute financially and their state of development towards a national DiSSCo consortium.
In general, DiSSCo aims for representation - on the national level - through a developed national consortium or node. Within each node, the partners will sign a national Memorandum of Understanding (MoU) to formalise their relationship and appoint a representing institution. This institution then signs a European MoU with DiSSCo in the name of the national node. Once the European DiSSCo MoU has been signed, the member can apply for membership to the GA with a written letter to the Chair and a copy to the Coordination Office.
However, DiSSCo recognises that not all countries are the same when it comes to the maturity of national networks of natural science institutions. Therefore, an exception to the desired representation mode of a national mode will be granted for the duration of the interim governance. This exception states that in such cases where a national node has not yet formally formed, natural science collection-based institutions can become a member of the DiSSCo GA.
Types of Membership:
6.2.2.3. Decision-making
As the decision-making body of DiSSCo, the GA will need to take decisions concerning:
These decisions will either be taken during GA meetings or intersessionally through a dedicated platform.
The Preparatory Phase of DiSSCo has started in 2018 and aims at the improvement of the overall Implementation Readiness Level (IRL) of the infrastructure across all the relevant dimensions of its future operation. To achieve this, DiSSCo is currently linking together a series of externally funded projects that distinctly contribute into one or more of those dimensions. These projects are part of a multi-partner multi-million work programme that includes tens of institutions and hundreds of contracted researchers, curators, software engineers and managers. As the executive body of the DiSSCo Research Infrastructure (RI) preparatory phase, the DiSSCo Coordination and Support Office (CSO), is acting as interim support office until the RI becomes operational. It holds the overall responsibility for the successful coordination of the pan-European projects linked to DiSSCo and the synchronisation with national activities.
The CSO undertakes multiple tasks, pursuant to the European DiSSCo MoU:
To a certain extent, several aspects of the above list of responsibilities are incorporated into the work programmes of the DiSSCo-linked projects. Part of the crucial list of activities of the DiSSCo CSO, however, are not (or cannot) be included in the externally funded DiSSCo-linked projects because of limitations in the scope of those projects. Those activities will be supported, for the duration of the Preparatory phase, by the resources contributed by the DiSSCo Consortium, and specifically the DiSSCo Governing Body full members. This new governance and financial model ensures the operation of the DiSSCo CSO during the preparatory phase with emphasis on capacity enhancement, outreach and engagement, alignment and coordination, training, business development and infrastructure piloting.
Over the last three years, the small team of three people gradually increased to a group of nine professionals (as of March 2020) working across the different operational aspects of the office. With the support of the winding-up Steering Committee and the DiSSCo nodes, the CSO has now access to an efficient toolkit, which allows it to navigate the strategic, technical and organisational complexities of the four-year Preparatory Phase
The advisory boards, formed by international experts, will play a key role in highlighting issues to consider, tabling risks to mitigate, or identifying specific new challenges to address. They will ensure that both areas – technological infrastructure and scientific coverage – are aligned and developed coherently.
In doing that, the advisory bodies may be supported by other consulting bodies and forums established during the preparatory phase.
6.2.4.1. Scientific Advisory Board (SAB)
DiSSCo is a Research infrastructure aiming at providing end‐user scientific services in support of frontier data‐intensive science. As such, its development and operation are heavily predicated upon its continuous ability to serve the needs of a diverse set of scientific, political and industrial users. As those needs change, DiSSCo needs to constantly develop its services through a permanent and productive dialogue with its users. To this end, the DiSSCo Scientific Advisory Board will take stock in this process by providing expert advice across the DiSSCo governance and executive structures.
6.2.4.2. Technical Advisory Board (TAB)
DiSSCo, as a research data infrastructure, is heavily investing in its ability to link together and serve data classes related to the European natural science collections. To this end, it plans to deploy a comprehensive data model, which enables the development of world-class e-Services for a diverse set of scientific, industry and policy audiences.
DiSSCo does not develop its technical architecture in isolation. Instead, it makes use and augments existing underlying and complementary data systems and services, whilst innovating where necessary to reach the required technical readiness level.
The Data Management Plan (DMP) of DiSSCo will provide overall guidance on the future implementation and operation of the infrastructure. The implementation of this plan will require a series of key technical decisions to be made, whilst the ever-changing technological landscape mandates further amendments to the existing DMP.
The DiSSCo Technical Advisory Board (TAB) mission will be to provide expert consultation services in all areas related to the technical sphere of operation of DiSSCo. It has the following objectives:
6.2.4.3. Funders Forum (FF)
During the DiSSCo Prepare activities, undertakings will be pursued to develop enough level of trust by national funders in the benefits DiSSCo will bring for their national efforts and scientific agendas. Discussions at national level on their involvement in the RI will contribute to further set up a refined governance and business structure for the subsequent phases of DiSSCo.
Representatives from national authorities will participate as representatives in the future governance model of the Research Infrastructure during its construction and operational phases. In its role, the mission of FF will be to provide recommendations on both, strategic and operational planning, that guide DiSSCo on how to set up a smooth transition from a project-based model to a sustainable organisational and financial model, well-embedded into national roadmaps.
The Coordination and Support Office activities together with DiSSCo Prepare Work package 8 team will set up and support the operation of the FF, ensuring a fluent communication with the interim General Assembly.
6.2.4.4. Stakeholders Forum (SF)
Ongoing discussions with initiatives working in similar areas of interest at European and International level led to DiSSCo’s involvement in a series of initiatives and projects, positioning DiSSCo as a key Research Infrastructure, a foundational layer for cross-cutting research.
During the DiSSCo Prepare project, it is planned to set up a Stakeholders Forum to ensure permanent engagement. This is intended to strengthen engagement with the stakeholders i.e., the institutions that support and develop the DiSSCo Research Infrastructure. The way stakeholding institutions relate to and work with DiSSCo differs from country to country. Most have constituted a National Task Force (NTF) with a leading spokes institution, while in other countries institutions operate individually.
The mission of the Stakeholders Forum will be to provide critical feedback from a scientific, technical, data, organisational and financial perspective during the Preparatory Phase.
The mode of operation of the Stakeholder Forum will be part of the developments within the DiSSCo Prepare project.
6.2.4.5. Industry Forum (IF)
To channel needs of the community to industry and to obtain technological and other perspectives on what is possible it can be helpful at some moment to establish an Industry Forum (IF).
In addition to the DiSSCo Coordination and Support Office (6.2.3), the Strategic Alignment of Projects (SAP) group and the Synchronisation Groups (SG) have been established to improve coordination within the DiSSCo Programme.
6.2.5.1. Strategic Alignment of Projects (SAP)
Running a programme of multiple simultaneous projects linked to the DiSSCo vision (e.g., ICEDIG, SYNTHESYS+, COST MOBILISE, DiSSCo Prepare) (Fig.
The Strategic Alignment of Projects (SAP) group consists of the Coordinators of the DiSSCo-linked projects and the DiSSCo Coordinator. Its mission is to guarantee the best use of the resources available and the achievement of the programme objectives.
SAP, as its title indicates, works towards the overall alignment of activities across the existing projects. SAP has recently set up five Synchronisation Groups (SG) to enhance the work led by SAP in that matter.
6.2.5.2. Synchronisation Groups (SGs)
Several thematic areas of related work have been identified as cross‐cutting topics relevant towards the readiness of the future DiSSCo Research Infrastructure. To enable efficient alignment and ensure adequate support between all the DiSSCo linked activities, five Synchronization Groups (SG) have been established:
The SGs act under the supervision of the SAP and comprise work package and task leaders from across the DiSSCo linked projects to guarantee the alignment of outcomes. The task of the SGs is to identify gaps, complementarities and/or overlaps among the different thematic streams of DiSSCo, ensuring that the work undertaken under different projects is sufficiently coordinated.
These SGs act as specialist groups advising across domains during the Preparatory Phase. Having accrued highly specialised knowledge during this phase and representing a highly specialised critical mass, the SGs will most likely continue in further developments during the implementation and operational phases. How that will evolve, which role they should adopt and how their relationship with the advisory bodies will be established are open questions that will need to be answered during the DiSSCo Prepare project.
ESFRI-endorsed European Research Infrastructures usually go through a succession of semi-standardised funding models. Those models can be summarised as follows:
Considering the different funding models, a RI needs to transition between them as it progresses from one development phase to the next. This creates a critical funding path that needs to be preserved throughout, for the RI to retain continuous funding. The critical funding path (Fig.
Like other European research infrastructures before it, DiSSCo transitions between the funding models A and B*
After discussions with national contact points and analysis of national decision-making processes towards RI funding commitments, a list of key criteria as perceived at the present time (spring 2020) can be summarised as the following:
Circumstantial criteria, such as internal political volatility or national fiscal capacity might temporarily also affect the position of a national government towards committing to the construction and operation of a European RI.
Considering the complexity and diversity of the criteria that might affect the decision for future funding of a RI, as well as, the lack of long-term financial commitment of countries towards RIs, it becomes apparent that the sustainability of a RI cannot be easily secured. Continuous adjustment of key operational and organisational parameters of the RI might be needed in order to retain the interest and commitment of national governments.
A crucial factor is the diversity of national funding flows towards research infrastructures, and specifically the lack of flexibility from governmental budgets in adjusting funding flows to the changing development requirements of the infrastructures.
The funding instruments and funding flow channels, which national governments opt for in support of the construction and operation of a European RI might be an important factor for the longevity of such commitments.
In this model, national governments directly commit resources to the RI. Commitments towards the RI usually last between from two to five years, depending on the development phase of the RI and the funding evaluation cycles of the national government. National governments can delegate the funding responsibility towards European RIs to corresponding governmental departments or scientific councils. This model (Fig.
Despite that this model continues to be the predominant one for the funding of RIs, a SWOT analysis (Fig.
Infrastructure funding is heavily dependent on the capacity of RIs to present strong science-policy narratives aligned with corresponding national-level agendas. To mitigate the risk of the volatile nature of government-level funding, RIs need either: i) to be able to diversify their funding streams, complementing national-level resources with other income; and/or ii) improve the stability of government-derived funding, securing funding cycles of a minimum of five years. Such longer-term funding cycles would enable the RIs to implement multi-year action plans, in accordance with typical 5-year science strategies.
The non-profit and public nature of European RIs can be perceived as a limitation to the capability of the RI to offer commercial services to relevant industry users. Additional limitations, regarding the source of income from commercial activities, also often derives from the legal vehicle that a RI has chosen to adopt. RIs that opt, for example for the adoption of the legal entity of the European Research Infrastructure Consortium (ERIC) are bound by certain limitations regarding the percentage of total income that derives from commercial activities. Similar limitations might be sometimes self-imposed by the governing bodies of the RIs, especially given that those governing bodies are usually comprised by representatives of the national funders that support the RI.
The above create a situation where RIs are perceived to have limited capabilities for commercial exploitation of their services, limiting the options to significantly diversify their income sources.
The increasing number of European RIs puts pressure to governments to invest in a multitude of RIs. The overhead for national governments linked to these commitments is not only relevant to the financial impact on national budgets but also to the administrative costs of monitoring and evaluating these investments and participating in their respective governance structures.
Consequently, the RIs often find themselves in a position in which they need to compete to secure funding. Furthermore, on many occasions, such competition is more prominent between RIs that operate within the same thematic area (e.g., environment, in the case of DiSSCo). That is because such RIs usually appeal to budgets owned by the same governmental departments or science councils. Such intra-domain competition for funding can, in turn, limit the potential for cross-infrastructure collaboration and subsequent interoperability.
A further factor influencing success in securing funding is strong national emphasis on specific areas of science as being most in the national economic interest. If an RI looking for funding is not addressing one of those areas, the opportunities to secure investment can be lower.
The increasing number of European RIs that apply for national level funding puts additional financial and administrative strain on national governments. This establishes distinct and usually isolated funding streams between funder and an RI. Such funding streams are difficult to establish and once in place they largely remain unchanged, in terms of funding size, unless the funder agrees to re-evaluate its contributions. Research infrastructures, however, are dynamic organisations that undergo significant changes in their operational and investment plans, as they evolve through their development phases. An RI at the beginning of its implementation phase typically requires significantly more financial resources (construction costs) than a RI already in its operational phase. Furthermore, these distinct funding flows do restrict governments from better evaluating how they can better balance funding allocated to a specific strategic direction or international reporting obligation (e.g., biodiversity monitoring) across the RIs relevant to a specific strategy or obligation. It becomes apparent that distinct and isolated funding flows from national governments directly towards RIs can reduce the level of funding responsiveness needed to meet the changing needs of the RIs.
To address this issue, several countries are now introducing an intermediate layer of coordination of scientifically related activities. In this model, a national bureau coordinates the funding allocations and flows towards the underlying international RIs and across the national participating facilities. This enables countries to retain a more agile way through which they can allocate funding towards relevant national efforts on one hand, and on the other hand, to the corresponding international infrastructures (Fig.
There are already successful examples that follow this model of hourglass funding for RIs.
In the Netherlands, NLBIF*
For more than 10 years, the European Commission has funded RI cluster projects linked to the identified major thematic domains in the ESFRI roadmap*
Such focused and demand-driven investments can be more productive and support RIs to achieve the level of needed cross-RI data and system interoperability. They will, furthermore, enable RIs to provide better, more focused and tailored to the needs, scientific services to their audiences and specifically support new users that investigate multi-disciplinary research objectives.
Frontloaded investments are a typical need for developing RIs, usually required either at the beginning of the implementation phase (construction) or during planned re-investment cycles for renewing equipment and enhancing software). Start-up and construction costs can add up to multiple times the projected annual operational costs of an RI and can represent a considerable commitment. However, the typical financing committed by national funders aims to mostly cover operational budgets of the RIs.
The challenge is especially acute for those RIs that typically do not procure or construct tangible assets (property, plant, equipment) that can be used as collaterals for securing financing. RIs such as DiSSCo are both distributed in nature and heavily reliant on process and software infrastructure to deliver their value. They require investments that do not traditionally lead to strong (tangible) asset value from the accounting point of view. Furthermore, the operation of distributed RIs is predicated on investments made across multiple legal entities that together are contributing towards achieving the DiSSCo vision. This further perplexes the ability of a RI legal entity to access finance.
To address this issue, RIs can investigate the opportunity to leverage complementary financial instruments, such as European Structural and Investment Funds and funds from the European Investment Bank (EIB). EIB has relevant programmes (e.g., EIB InnovFin and InnovFin Advisory programmes) that enable public organisations, and particularly research and innovation organisations, to benefit from accessing large scale financial instruments. InnovFin Advisory, for example assists RIs to improve their overall bankability i.e., their ability to access financing through commercial or public financial institutions. The bankability aspects of RIs is still a topic that is not well investigated and further work in this area would allow us to better understand how DiSSCo can make wider use of financing options available, in order to improve financial capacity.
Recommendation 103: DiSSCo should investigate what is needed to improve bankability (likelihood of financial success) against the range of financial/investment instruments (e.g., European structural and investment funds, European Investment Bank programmes) available to complement national government funding.
Further work is needed to understand what models need to come in place to enable RIs to access financing instruments. Such models can include for instance the extension of the financial liability of funders beyond the annual financial commitment. Shared liability models can significantly improve the overall bankability of the RI and allow for more effective financial planning. They require, however, a new approach in the way national funders see their responsibility towards these RIs at national and at European level.
Such work may lead to future consideration of alternative business models, especially in the way that software development, enhancement and maintenance and infrastructure operations are performed.
Recommendation 104: DiSSCo should investigate shared liability models for more effective financial planning and how these may lead to alternative business models for DiSSCo.
Collection digitization efforts have reached most collection-holding institutions across Europe. Much of the leadership and many of the people involved in digitization and working with digital collections wish to take steps forward and expand the efforts to benefit further from the already noticeable positive effects. The collective results of examining technical, financial, policy and governance aspects (summarised in the present document as the results of the EU ICEDIG project, grant agreement No. 777483) show the way forward to operating a large distributed initiative i.e., the Distributed System of Scientific Collections (DiSSCo) for natural science collections across Europe. Ample examples, opportunities and need for innovation and consolidation for large scale digitization of natural heritage have been described. Numerous (104) recommendations have been made to be considered by other elements of the DiSSCo Programme of linked projects i.e., SYNTHESYS+, COST MOBILISE, DiSSCo Prepare, and others to follow, and the DiSSCo Programme leadership as the journey towards organisational, technical, scientific, data and financial readiness continues.
However, several significant obstacles must be overcome as a matter of priority if DiSSCo is to move beyond its Design and Preparatory Phases during 2024. Specifically, these include:
Organisational:
Technical:
Scientific:
Data:
Financial:
Strategically, it is vital to balance these multiple factors – organisational and political, technical and engineering, scientific and data, financial and legal, operational and governance – against one another to achieve the desired goals of the DiSSCo programme. Decisions cannot be taken on one aspect alone without considering other aspects, and here the various governance structures of DiSSCo (General Assembly, advisory boards, and stakeholder forums) have a critical role to play over the coming years.
The following ICEDIG project deliverables have been referred to at various points in the present document via footnotes. They are tabulated here in their entirety for convenience (Table
Number |
Title |
Date |
DOI |
D1.2 |
Final report |
April 2020 |
|
D2.1 |
Inventory of criteria for prioritization of digitization of collections focussed on scientific and societal needs |
July 2018 |
|
D2.2 |
Prioritizing scientific and societal needs for data using small and private collections, Appendix 2, Appendix 5 |
October 2018 |
|
D2.3 |
March 2019 |
||
D2.3 |
March 2019 |
||
D3.1 |
April 2019 |
||
D3.2 |
State of the art and perspectives on mass imaging of microscopic and other slides |
April 2019 |
|
D3.3 |
State of the art and perspectives on mass imaging of skins and other vertebrate material |
May 2019 |
|
D3.4 |
State of the art and perspectives on mass imaging of liquid samples |
June 2019 |
|
D3.5 |
State of the art and perspectives on mass imaging of pinned insects |
July 2019 |
|
D3.6 |
Best practice guidelines for bulk imaging of herbarium specimens |
August 2019 |
|
D3.7 |
Rapid 3D capture methods in biological collections and related fields |
September 2019 |
|
D3.8 |
R&D in robotics with potential to automating handling of biological collections |
January 2020 |
|
D4.1 |
January 2019 |
||
D4.2 |
January 2019 |
||
D4.3 |
July 2019 |
||
D4.4 |
Interoperability with institutional collection management systems |
April 2019 |
|
D4.5 |
Cost analysis of transcription methods |
December 2019 |
|
D5.1 |
Recommendations for volunteer transcription systems and a source repository |
April 2019 |
|
D5.2 |
Improving the detection of collection-based citizen science projects |
May 2019 |
|
D5.3 |
June 2019 |
||
D5.4 |
January 2020 |
||
D6.1 |
March 2018 |
||
D6.2 |
February 2019 |
||
D6.3 |
July 2019 |
||
D6.4 |
Digitization infrastructure design for national open science clouds |
August 2019 |
|
D6.5 |
September 2019 |
||
D6.6 |
Provisional Data Management Plan for the DiSSCo infrastructure |
October 2019 |
|
D7.1 |
July 2019 |
||
D7.2 |
Common digital research agenda |
February 2020 |
|
D8.1 |
Conceptual design blueprint for the digitization infrastructure of DiSSCo |
March 2020 |
DOI to be assigned on publication by Pensoft |
D8.2 |
Cost Book of the digitization infrastructure of DiSSCo |
March 2020 |
DOI to be assigned on publication by Pensoft |
D9.1 |
March 2018 |
||
D9.2 |
Linking cultural heritage of natural sciences and humanities |
February 2020 |
|
D9.3 |
January 2020 |
||
D9.4 |
Positioning DiSSCo among other research infrastructures |
January 2020 |
Terms and abbreviations used in the present document have the meanings given below.
Annotations: Assertions made on or about the Physical or Digital Specimen, such as determination of the species and comments. One of the main data types managed by DiSSCo.
Authoritative data: Data that is authoritative about a specimen or collection and under the control of a curator. See also definitions of data and supplementary data. One of the main data types managed by DiSSCo.
Collection Digitization Dashboard (CDD): A system that collects and presents reliable, complete and up-to-date information on the taxonomic and geographic scope of collections as well as the degree and level of digitization already achieved and remaining.
Collection Management System (CMS): A system (typically a database) for recording and organising information about the objects in a museum or other collection.
CMS-as-a-Service (CMSaaS): A CMS provided as a service by an infrastructure provider/operator on behalf of and for the benefit of its community of users i.e., not provided and managed by the collection-holding institution itself for itself.
Content data: An alternative term for data when it is necessary to differentiate from other kinds of data (such as metadata). See also the definition of data. One of the main data types managed by DiSSCo.
Data: Data relating directly to describing collections and physical specimens, such as (in the latter case) images of those specimens, information from specimen labels (such as scientific name, location where collected, date collected, collector name, etc.), or measurements and other analyses of specimens. One of the main data types managed by DiSSCo. The term 'content data' is sometimes also used to refer to this kind of data; for example, to avoid confusion with metadata.
Note: An essential characteristic of this data is that it is authoritative about a specimen or collection. That is, the information that this data represents has been determined by the scientists and curators at the owning institution and it is they alone that hold authority to make changes to it as knowledge and understanding about specimens and collections evolves. When clarity is need, the term 'authoritative data' is sometimes used. Contrast with the definition of supplementary data.
Digital collection: A digital representation (surrogate) corresponding to a collection of identifiable natural science specimens. Cf. digital specimen
Digital collection object type (DCO): A collection of property definitions about a digital collection, the structure of which conforms to the requirements of the openDS specification (see 4.3.3).
Digital object (DO): A bit sequence with a persistent identifier (pid) and a type.
Note: This definition is provided by the DONA Foundation (https://www.dona.net/) as custodian of the Digital Object Architecture. The long definition is: "A sequence of bits, or a set of sequences of bits, incorporating a work or portion of a work or other information in which a party has rights or interests, or in which there is value, each of the sequences being structured in a way that is interpretable by one or more of the computational facilities, and having as essential elements an associated unique persistent identifier (pid) and a type." [
For all practical purposes, the concept of a digital object is the same as the notions of a digital object defined by the Society of American Archivists and the Research Data Alliance, and the same as the notion of digital entity defined in ITU-T Recommendation X.1255. A specific characteristic of digital objects is that they can possess methods that can be invoked upon their contents.
Digital Object Architecture (DOA): A logical extension of the Internet architecture supporting digital information management more generally than just conveying units of information from one place in the Internet to another. [
Digital Object Interface Protocol (DOIP): One of two standard communication protocols (the other being the Identifier Resolution Protocol) supporting the Digital Object Architecture, that specifies a standard way for software clients (applications and services) to interact with digital objects. [
Digital specimen: A digital representation (surrogate) corresponding to an identifiable physical specimen in a natural science collection. Cf. digital collection.
Digital specimen object type (DSO): A collection of property definitions about a digital specimen, the structure of which conforms to the requirements of the openDS specification (see 4.3.3).
Digitization: The process of converting analog information about physical specimens to digital format, which includes electronic text, images and other representations. See also mass digitization.
Digitization line/factory: The facilities (premises, personnel, equipment (hardware, software)), processes and procedures necessary for large-scale, mass digitization.
Digitization-on-demand (DoD) (also known as demand-driven digitization): An activity where selected specimens are digitized, often based on specific requests. Cf. mass digitization.
DiSSCo: See Distributed System of Scientific Collections.
DiSSCo Centre of Excellence (DCE): A designated DiSSCo Facility specialised in one or more of researching, innovating, developing and operating/performing techniques and/or processes of digitization or other related facets, and disseminating information on same.
DiSSCo Facility(ies): The geographically distributed collection-holding organisation(s) (i.e., natural science/history collection(s)) and related third-party organisations (such as DiSSCo Centre of Excellence) that deliver data and expertise to the DiSSCo Hub infrastructure, and which can be accessed by users via the DiSSCo Hub infrastructure. Cf. definition of DiSSCo Hub.
DiSSCo Hub: The infrastructure of integrating services, information technology components (hardware and software), human resources, organisational activities, governance, financial and legal arrangements that collectively have the effect of unifying natural science collections through a holistic approach towards digitization of and access to the data bound up in those collections. Cf. definition of DiSSCo Facility(ies).
Distributed System of Scientific Collections (DiSSCo): A pan-European Research Infrastructure mobilising, unifying and delivering bio and geo-diversity digital information to scientific communities.
European Collection Objects Index (ECOI): A searchable, electronic index of catalogued objects (typically, specimens and collections) held by and discoverable (findable, accessible) in natural science collections of the DiSSCo collection-holding partners.
European Curation and Annotation System (ECAS): A system enabling direct contributions to the curation and improvement of natural science data, with advanced annotation services that including recording of annotation history and management of object annotations.
European Loans and Visits System (ELViS): A unified pan-European system for managing loans and visits access to any collection for any authorised user under a consistent access policy (for restrictions, responsibilities, reporting, etc.).
FAIR: A set of four foundational principles (Findability, Accessibility, Interoperability, and Reusability) serving to guide data producers and publishers towards good data management and stewardship. See [
Handle System: An implementation of the Identifier and Resolution component of the Digital Object Architecture (DOA).
Mass digitization: An activity where entire collections, or their distinct major parts are digitized from one end to the other, without selecting individual specimens. Mass digitization is characterised by technological and procedural frameworks based on automation (hardware and software) and enrichment (link-building), with workflows at industrial scale, i.e., processing millions of objects at low cost. Cf. digitization-on-demand.
Metadata: Metadata is additional data that establishes a context for the content data to which it relates i.e., it is data about data. One of the main data types managed by DiSSCo.
MIDS (Minimum Information about a Digital Specimen): The minimum information standard for digital specimens specifies the mandatory and optional information elements that must be present in a digital specimen at various levels of digitization.
MICS (Minimum Information about a Digital Collection): The minimum information standard for digital collections specifies the mandatory and optional information elements that must be present in a digital collection at various levels of digitization.
Natural Science Identifier (NSId): A kind of persistent identifier for uniquely and universally identifying digitized natural science specimens (i.e., Digital Specimens) and other associated object types.
Persistent identifier (PID): A persistent identifier is a string (functioning as a symbol) that identifies a digital object. The identifier can be persistently resolved (digitally actionable) to meaningful metadata state information about the identified digital object. In the case of DiSSCo, Natural Science Identifiers (NSId) are the principal persistent identifiers used.
Provenance data: Data providing a traceable record about other data (e.g., content data, metadata), its origins and the processing actions applied to that. One of the main data types managed by DiSSCo.
Supplementary data: Other content data about a specimen, additional to the authoritative data that contributes to an overall understanding of the specimen. Supplementary data can be generated by specimen owners and/or by third parties and can include biodiversity literature, DNA sequence data, chemical composition data, acoustic recordings, and other information relating to specific specimens and collections. One of the main data types managed by DiSSCo.
The ICEDIG project consortium (see Acknowledgements section) makes 104 recommendations to the DiSSCo Coordination and Support Office, the DiSSCo General Assembly and to projects of the DiSSCo programme. For convenience these are presented as a summary list in this appendix. For each recommendation, a linked section number indicates where the recommendation is made and discussed in the main body of text.
Recommendation 1: It is important to set a clear international digital research agenda that can serve as a guideline within DiSSCo to determine what to prioritize in terms of digitization of collections and specimens in more detail ... §3.1
Recommendation 2: DiSSCo should promote that a global initiative, such as the Alliance for Biodiversity Knowledge or GBIF adopt and sustain the RI Database, ensuring that it is open, accessible and usable as a tool for identifying potential synergies and opportunities for collaboration with other research infrastructures ... §3.1
Recommendation 3: Actions on common policy elements across DiSSCo institutions must be taken in the context of a common framework of policy definition and implementation that recognises the organisation of policies and responsibility for implementation at local level, and authorization for change within collection-holding institutions ... §3.2
Recommendation 4: DiSSCo should exploit the diversity of available citizen science platforms (e.g., for specimen label transcription), taking advantage of their individual strengths (surrounding community interest, language specificity, etc.) as appropriate and should encourage such platforms to implement the data exchange format and protocol for transcription platforms (doi: 10.5281/zenodo.2598413), as well as supporting this format/protocol in digitization lines, workflows and collection management systems ... §3.3
Recommendation 5: The involvement of citizen scientists in DiSSCo data work and activities must be properly acknowledged and attributed, for example using Research Data Alliance recommendations for the representation of attribution metadata (doi: 10.15497/RDA00029) ... §3.3
Recommendation 6: Recognising the likely future increase in citizen science involvement with natural science collections, DiSSCo should further develop a package of business model principles and guidance that collection-holding institutions can use to design and manage citizen science engagements and activities to their collections ... §3.3
Recommendation 7: DiSSCo should establish criteria and procedures for assessment and due diligence of activities, services and components of relevance to any potential alliance that keep in mind the common digital research agenda of DiSSCo ... §3.4.5
Recommendation 8: DiSSCo should continue dialogue with representatives of complementary research infrastructures to ensure convergence towards a common approach, especially in the context of the European Open Science Cloud ... §3.4.5
Recommendation 9: Working proactively with international partners DiSSCo should aim to avoid divergence in technical approach to the support of global collections-based science ... §3.4.5
Recommendation 10: DiSSCo should further clarify, develop and nurture the joint ventures (strategic alliances) that will be important to its plans and operations, including with CETAF, GBIF, iDigBio (or equivalent), EOSC, etc ... §3.4.6
Recommendation 11: DiSSCo should exploit generic services for data storage and computation where possible and procure from service providers having the ambitions and aims of open science at the heart of their mission ... §3.7.1
Recommendation 12: DiSSCo must adopt the FAIR Digital Object Framework (FDOF) and its realisation through Digital Specimen, Digital Collection and other relevant digital object types as the basis for complying with the FAIR Guiding Principles for natural sciences data management, and as the means of delivering FAIR compliant natural sciences data into the European Open Science Cloud (EOSC) ... §3.7.2
Recommendation 13: DiSSCo should propose the legal form required to achieve its aims and objectives and to administer and support its operations, keeping in mind the need for long-term viability and stability, the need to be able to enter into legal agreements with third-parties, and the need to assume responsibility for and mitigate risks and liabilities ... §3.8.1.3
Recommendation 14: For each broad policy area affecting DiSSCo activities and directly covered by mandatory legal and regulatory considerations for data management, DiSSCo must list the legislation and regulations that apply at national and/or European level and say how DiSSCo and its member institutions will comply with each of the mandatory requirements (for example, by indicating specific clauses in the DiSSCo Data Management Plan). The broad policy areas are: i) Access and Benefit Sharing (ABS); ii) Data and digital media publication; iii) FAIR / Open Data / Open Access; iv) Freedom of information (FOI); v) Intellectual Property Rights (IPR); vi) Data Standards; vii) Personal data; viii) Protection of sensitive collections data; ix) Public Sector Information; x) Responsible Research and Innovation (RRI); xi) Cloud services and storage; xii) Information risk management; xiii) Information security; xiv) Collections access and information; xv) Collections care, development and scope; and xvi) Digitization strategy and prioritisation ... §3.8.2
Recommendation 15: For each of the broad policy areas mentioned in recommendation 14 affecting DiSSCo activities and affected by legislation indirectly, DiSSCo should state what practices it will adopt to make compliance easier to achieve and police ... §3.8.2
Recommendation 16: DiSSCo must give specific attention (perhaps by implementing a ‘compliance and moderation service’) to the rules governing the movement of sensitive data across international borders i.e., between European Union Member States and third countries (including defining specifically what is meant by ‘sensitive data’ in the context of the legislation affecting DiSSCo operations) ... §3.8.2
Recommendation 17: DiSSCo strategy for mass digitization must focus on clearing the historical backlog of undigitized specimens in the next 20 – 30 years, whilst recognising that newly collected accessions, and small and private collections also each require their own organisation (workflow) of digitization to prevent new backlogs from forming. The specific collection type also dictates the appropriate technical approach and although herbarium sheet and pinned insect digitization is well-developed or advancing, greater emphasis must be placed on other collection types, including non-biological ones ... §3.9.2
Recommendation 18: DiSSCo should plan to achieve an average digitization cost of €0.5 or less per specimen, across major collections types to which mass workflows can be applied ... §3.9.3
Recommendation 19: DiSSCo should develop a decision support tool to assist institutions to decide on the optimal strategy for digitization of their collections in-house, outsourced, or mixed approach ... §3.9.4
Recommendation 20: DiSSCo should launch early calls for consortia to establish specialised Centres for Excellence on mass-digitization, in readiness for entering the operational phase of DiSSCo ... §3.9.4
Recommendation 21: DiSSCo should promote re-use and/or cost-sharing of digitization equipment across institutions and projects where possible, particularly for smaller collections ... §3.9.4
Recommendation 22: DiSSCo should design a portfolio of services and support fitting to several organisational levels that supports the ambition to organise and consolidate a distributed system of scientific collections across Europe ... §3.9.4
Recommendation 23: DiSSCo should determine the required number, locations and specialisations of digitization (or related) facilities across Europe, including Centres of Excellence where appropriate ... §3.9.4
Recommendation 24: DiSSCo should consider building an experimental facility (DiSSCo Centre of Excellence) for an ‘out-of-town’ fully automated, industrial scale specimen storage and digitization facility ... §3.9.4
Recommendation 25: For each kind of digitization and collection type, DiSSCo should offer structured leadership in digitization approaches, proposing best practice approaches to its institutional members, and helping them to achieve the best digitization results in terms of quality (best, according to agreed specifications), time (highest throughput, fast), and cost (lowest, minimal per specimen) for each specific kind of digitization activity ... §3.9.5
Recommendation 26: Based principally on scientific relevance but also considering collection, economic and societal relevance and feasibility / cost-effectiveness, DiSSCo must establish a framework of prioritisation criteria and a set of tools and procedures for making and objectively justifying consistent digitization prioritisation decisions ... §3.10.2
Recommendation 27: DiSSCo should design and promote Digitization-on-Demand services and workflows appropriate to different collection and specimen categories that can be adopted by collection-holding institutions to become part of their normal business of digitization, including for accession of newly collected materials ... §3.10.3
Recommendation 28: DiSSCo institutions should consider quickly creating MIDS-1 level inventories of their entire holdings to facilitate access to specimens and planning of more detailed digitization activities, and to create a comprehensive cohort of Digital Specimen data ... §3.10.4
Recommendation 29: DiSSCo’s Collection Digitization Dashboard (CDD) service must be compatible with and implement TDWG Collection Description standard(s) when this becomes available ... §3.10.5
Recommendation 30: The service underlying the Collection Digitization Dashboard should automate as much as possible the collection, transformation/collation and presentation of collection-level information from collection-holding institutions. For an interim period, manual data entry may be necessary to ensure early public availability of collection-level information (i.e., while work to complete automation of data collection is in progress) ... §3.10.5
Recommendation 31: DiSSCo should provide guidelines for how private actors can digitize their collections and share data via the ECOI service and should ensure that the European Collection Objects Index (ECOI) service offers catalogues of private collections ... §3.11
Recommendation 32: DiSSCo should develop a package of support measures (communication of benefits, education/training in digitization, digitization tools and facilities, access to data sharing platform, use of volunteers, etc.) targeted towards private collection owners in line with digitization prioritisation decisions, to increase digitization of these kinds of collections ... §3.11
Recommendation 33: As part of its remit to publish minimum information about available collections, DiSSCo should maintain an online inventory (e.g., a website) of available private collections and their characteristics, with an associated protocol for keeping this up to date ... §3.11
Recommendation 34: Adequate investment (c. €1.5m per annum) and time (4 years) for new software development, testing, deployment and maintenance must be made if the innovative functionalities and services foreseen by the DiSSCo vision are to be realised ... §3.12.1.2
Recommendation 35: Design and development of core software components needed by DiSSCo should begin no later than early 2021 to allow modest, soft-start to DiSSCo operations in 2024 ... §3.12.1.4
Recommendation 36: DiSSCo should develop common design specifications, especially for ‘look and feel’ of interactive user interfaces, that software components and services should adhere to ... §3.12.1.5
Recommendation 37: DiSSCo must establish, train and equip a motivated and cohesive engineering development and operations (DevOps) team where team members have the freedom to express their individuality and competence as professionals to support and interact with users and develop new software, contributing autonomously and responsibly to collective endeavours to meet present and future needs ... §3.12.2.1
Recommendation 38: Nine characteristics (centrality of the digital specimen, accuracy and authenticity of the digital specimen, FAIRness, protection of data, preserving readability and retrievability, traceability (provenance) of specimens, annotation history, determinability (status and trends) of digitization and securability) must be protected throughout the lifetime of the DiSSCo research infrastructure ... §4.2.1.3
Recommendation 39: All design decisions (technical, procedural, organisational, etc.) must be assessed for their effect on the protected characteristics. Such decisions and changes must not destroy or lessen the protected characteristics ... §4.2.1.3
Recommendation 40: Within ten years the institution-centric collection curation model should evolve to support complementary digital curation by appropriately authorised community-experts ... §4.2.1
Recommendation 41: DiSSCo Prepare should follow an aggressive ICT implementation strategy and construction plan based on two key lines of activity that include i) DiSSCo Hub implementation indexing Digital Specimen and other object types, and offering added value services such as ECOI, ELViS and ECAS; and ii) data coupling of DiSSCo Facilities to populate the systems and services with relevant data ... §4.2.1
Recommendation 42: DiSSCo Prepare should specify the Application Programming Interface(s), API needed to allow third-party software applications to be built on top of the DiSSCo Hub (core) infrastructure ... §4.2.2
Recommendation 43: DiSSCo should commence further development and hardening of the specimen data harvesting and transformation process as the means of creating Digital Specimens and populating DiSSCo data infrastructure ... §4.2.3
Recommendation 44: DiSSCo should ensure and provision the further work needed (during the DiSSCo Prepare Preparatory Phase project) to bring necessary innovations needed for data management infrastructure to the required level of technical, organisational and financial readiness. These innovations include: i) NSId PID scheme, ii) openDS standard, iii) MIDS/MICS minimum information standards, iv) FAIR Digital Object Framework ... §4.3.1
Recommendation 45: Open access policies of DiSSCo and its collection-holding institutions should include that Digital Specimen objects must be findable and accessible, even at the lowest level of available information (MIDS-0 level). Collection level information should be findable and accessible, even at the level of an overview (MICS-1 level) ... §4.4.1
Recommendation 46: Open access policies of DiSSCo and its collection-holding institutions should include that: i) image data and its immediate metadata should be deposited in a trusted public repository of the institution’s own choice, and ii) that other (non-image) data should be deposited in the European Collection Objects Index (ECOI) ... §4.4.2
Recommendation 47: DiSSCo should develop guidelines for its member institutions for determining whether and when it is appropriate to deposit specimen data, such as images of specimens in long-term public repositories such as EUDAT, Zenodo, Wikidata and others, having regard both for the purpose of such depositions and for the stability of the metadata describing the content of the deposition (i.e., what, where, when, who) ... §4.4.2
Recommendation 48: Open access policies of DiSSCo and its collection-holding institutions should include that as far as possible third-parties must be able to access, mine, exploit, reproduce and disseminate data by using a copyright waiver such as CC0 or an open access licence such as CC-BY ... §4.4.3
Recommendation 49: Open access policies of DiSSCo and its collection-holding institutions should include that access be “as open as possible, as closed as (legally) necessary” ... §4.4.3
Recommendation 50: Exceptions to the ‘as open as possible’ data access policy of DiSSCo and its collection-holding institutions must be justified based on objective criteria, stated clearly and strictly limited to reasons of national security, legal or regulatory compliance, sensitivity of collection information, and third-party rights ... §4.4.3
Recommendation 51: DiSSCo should adopt a lightweight ICT service management framework for the holistic delivery of its service portfolio ... §4.5
Recommendation 52: For mass digitization of microscope slides, the recommended approach in the first instance is to digitize on the lowest MIDS level (0,1) capturing an image of the whole slide including its labels. These images can be used later for extended data entry (MIDS 2,3), conservation assessment and subsequent research-grade imaging ... §4.6.1.1
Recommendation 53: For mass digitization of vertebrate and other dry three-dimensional collection objects, the recommended approach in the first instance is to digitize on the lowest MIDS level (0-1) combined with label imaging. These images can be used later for extended data entry (MIDS 2-3) ... §4.6.1.2
Recommendation 54: Setting up imaging protocols and stations for digitizing any type of collection should involve a professional (or experienced) photographer to establish the proper lighting and camera settings for the collection ... §4.6.1.2
Recommendation 55: For mass digitization of vertebrate and other dry three-dimensional collection objects, the recommended approach in the first instance is to digitize on the lowest MIDS level (0-1) combined with label imaging. These images can be used later for extended data entry (MIDS 2-3) ... §4.6.1.3
Recommendation 56: Further explore combinations of new technologies such as robotics, 3D modelling, machine learning, etc., in novel ways to achieve imaging (including of the labels) of 5,000 insect specimens in 24 hours by one workstation and operator ... §4.6.1.4
Recommendation 57: DiSSCo should prepare and promote community-built guidelines and checklists for future digitization projects, to assist the detailed preparation, costing and management of such projects ... §4.6.1.5
Recommendation 58: DiSSCo should investigate standardization of the interfaces of the components of mass digitization/imaging lines and their downstream data stores and processing elements (including quality control) and encourage open-source tools development in the area ... §4.6.1.5
Recommendation 59: DiSSCo should work with collection-holding institutions to improve the standardization of transcribed data in line with the emerging Minimum Information standards for Digital Specimens and Digital Collections (MIDS/MICS) and should seek to introduce community-agreed quality control and assurance plan and procedures for transcription ... §4.6.2.2
Recommendation 60: Digitization of field notebooks should be a priority and should precede digitization of any related specimens ... §4.6.2.3
Recommendation 61: DiSSCo should consider developing and offering a digital field notebook service ... §4.6.2.3
Recommendation 62: Adopting new techniques from natural language processing and geospatial information analysis, DiSSCo should investigate the improvement of georeferencing techniques for identifying more precise locations from natural language descriptions of locations that appear on specimen labels ... §4.6.2.4
Recommendation 63: AI-assisted image segmentation should be further developed by DiSSCo for routine use as a step in the digitization and transcription process ... §4.6.3
Recommendation 64: DiSSCo should consider an agreement (operational or strategic) with, for example Google for AI-assisted text recognition of specimen labels, including handwritten labels at industrial scale ... §4.6.3
Recommendation 65: As part of the DiSSCo programme portfolio, DiSSCo should sponsor a research and innovation project investigating novel approaches to label segmentation, transcription (i.e., text OCR and digitization) and interpretation (i.e., named entity recognition, georeferencing, people referencing, etc.) and synthesis and deployment of robust production pipelines (workflows) to improve efficiency, quality and utility of transcription data ... §4.6.3
Recommendation 66: DiSSCo should identify and further study the scientific needs that demand the use of 3D imaged specimens, as well as the scientific opportunities opened up by the availability of appropriately imaged 3D specimens to inform future planning for introducing 3D imaging on a more widespread basis ... §4.6.4
Recommendation 67: DiSSCo should establish a common quality policy and standard for digitization, including adopting a prevention approach to digitization process and workflow design and appropriate training of personnel ... §4.6.5.2
Recommendation 68: DiSSCo should adopt a conceptual framework for quality assessment and improvement of natural sciences data and should select and implement appropriate data quality tests and assertions. §4.6.5.3
Recommendation 69: DiSSCo should harmonise and establish quality requirements for image characteristics for common image classes expected from digitization and should aim to prevent quality defects arising in digitization processes through the use of automation, computer-vision and statistical process control techniques ... §4.6.5.4
Recommendation 70: To work towards a future automated end-to-end digitization solution, development should focus on independent components (including storage and retrieval, transport, object picking, and imaging) which can be connected in the future ... §4.6.6
Recommendation 71: If there is a real desire to use automation solutions in natural science collections for warehousing and/or digitization, a series of pilot projects should be established, which companies can participate in and in which collection holders collaborate ... §4.6.6
Recommendation 72: DiSSCo should play a role in the development of expertise in automation, in communication with companies, and aligning efforts in automation of natural science collections ... §4.6.6
Recommendation 73: DiSSCo should develop harmonised policies, procedures and best practices covering the different kinds of storage solution available for a wide range of anticipated data storage needs by collection-holding institutions ... §4.7.1
Recommendation 74: DiSSCo should encourage and provide opportunities in various forms (newsletters, fora, blogs, networks, conferences, etc.) for sharing expertise and knowledge among its digitization professionals ... §5.1.3.1
Recommendation 75: DiSSCo must plan and implement a comprehensive training programme covering all aspects of modern digitization and data mobilization ... §5.1.3.2
Recommendation 76: DiSSCo must offer a platform for sharing documentation and best practice guidelines of workflows from the digitization projects of its Facilities, so that new projects can start faster and learn from each other, with the appropriate citation ... §5.1.3.3
Recommendation 77: Stop using one-dimensional barcodes and move over to using a standard two-dimensional matrix code, with standardised data content, that can be read automatically from digital label and other images ... §5.1.3.3
Recommendation 78: DiSSCo should assist its collection-holding institution members to develop and strengthen their external profile (marketing) with funding agencies, professional and citizen scientist groups and local communities appropriate to their location and sphere of collections related operations (i.e., research, education and exhibition) ... §5.1.4
Recommendation 79: DiSSCo should organise a training curriculum for its member institutions covering: i) technological aspects, such as features and operation of equipment and software; ii) standards, i.e., museum and archival practices including data standards, in particular unique and persistent identifiers; iii) efficient digitization workflows in various situations, including quality management; and iv) for museum leadership ... §5.1.5
Recommendation 80: DiSSCo institutions should look for opportunities to use EU structural and investment funds to build up the digitization capacities in eligible countries and regions. DiSSCo should centrally support this activity with application packages and support for proposal writing targeted specifically for these funding sources which are not research oriented but aim for economic and social development ... §5.1.6
Recommendation 81: DiSSCo should investigate and promote best practices for operating models within collection-holding institutions whereby digitization becomes business as usual and digital by default ... §5.1.7
Recommendation 82: DiSSCo should deploy metrics (key performance indicators) to monitor impact and progress in collaboration and research facilitated by digitization and should publish the results annually ... §5.2.1
Recommendation 83: DiSSCo should deploy metrics (key performance indicators) that show how digitization is spreading across collections and to monitor the changes in mobility and usage of collections. The impact of digitization should be assessed on a regular basis ... §5.2.2
Recommendation 84: As well as providing access to Digital Specimens for research purposes, DiSSCo should consider the additional and different aspects that can pertain to providing access for educational purposes ... §5.2.3
Recommendation 85: DiSSCo should actively promote the role of private natural science collections as a form of citizen science ... §5.2.3
Recommendation 86: In working to promote private natural science collections as a form of citizen science, DiSSCo should take the lead to ensure that the metadata definitions needed to make private collections more publicly visible becoming incorporated into appropriate citizen science metadata standards, such as PPSR-Core ... §5.2.3
Recommendation 87: DiSSCo to develop a strategy for aligning and unifying the work practices across its facilities ... §5.3.1
Recommendation 88: DiSSCo should prepare a minimum specification of an ideal collection management system (CMS) and select/recommend preferred alternatives from the available product solutions to meet member institutions’ various needs ... §5.3.2
Recommendation 89: DiSSCo should adopt the “triple-eye eff” (iiif.io) image interoperability framework as its basis for media management and interoperability ... §5.3.2
Recommendation 90: DiSSCo should encourage the use of unique persistent identifiers for people collecting and working in collections ... §5.3.2.4
Recommendation 91: DiSSCo should encourage the use of persistently identified geographic name descriptions from recognised sources... §5.3.2.5
Recommendation 92: DiSSCo should encourage the further adoption of CETAF Stable Identifiers for the local and persistent identification of physical specimens ... §5.3.2.7
Recommendation 93: DiSSCo should identify benefits from and opportunities for third-party, value-added services arising through adoption of a Handle-based persistent identifier scheme for Digital Specimens, presently proposed as Natural Science Identifiers (NSId) ... §5.3.2.7
Recommendation 94: DiSSCo must evaluate, adopt and support modern alternative(s) to traditional spreadsheet approaches for gathering, collating, and analysing cost information and for budgeting and management of DiSSCo costs ... §5.3.3.1
Recommendation 95: In cost gathering, analysis and reporting, DiSSCo should convert, at the time of data entry from the currency of data entry to a standard currency for analysis and comparison purposes ... §5.3.3.1
Recommendation 96: The DiSSCo business model must take depreciation of capital equipment (tangibl assets) and amortization of intangible assets (i.e., DiSSCo data) into account, such that these costs can be accounted for and recovered over the long-term ... §5.3.3.2
Recommendation 97: DiSSCo and institutional leadership must plan for capacity enhancement in i) training of digitization and allied personnel; ii) funding to hire digitization personnel on long-term and for the effort in general; iii) increase in dedicated digitization staff; and iv) investments in modern efficient equipment, software, CMS and data solutions ... §5.4
Recommendation 98: DiSSCo should create focus on harmonised tools and frameworks to help institutions and individuals understand and develop their skills capabilities, needs and professionals (such as: digitization maturity assessment model, competency matrices and skills profiles, career paths) and should make the case to address these with each collection-holding institution ... §5.4
Recommendation 99: In alliance with an appropriate training provider, DiSSCo should develop and promote executive/senior level training in collection-holding institutions, with a specific focus on collection leadership, mobilisation and use in the digital information age ... §5.5
Recommendation 100: DiSSCo institutions should form consortia that consolidate activities by launching national or regional centres for large scale digitization and offer out-sourced digitization services, training and other capacity building for in-house digitization at the institutions. These could be funded through European structural and investment funds ... §5.5
Recommendation 101: DiSSCo centrally should launch thematic DiSSCo Centres of Excellence that support regional and national centres with technological innovations needed to ramp up the speed of digitization to the required levels of output. Such Centres may or may not be connected to digitization consortia/factories ... §5.5
Recommendation 102: To communicate and demonstrate the value of participating in DiSSCo, the DiSSCo Coordination and Support Office should initiate: i) further awareness raising and training, and ii) development and promotion of pilot applications and exemplars. Both activities must contribute to showing how DiSSCo can support the research goals of individuals and how DiSSCo can actively support and enhance the work of specific stakeholder groups ... §5.6
Recommendation 103: DiSSCo should investigate what is needed to improve bankability (likelihood of financial success) against the range of financial/investment instruments (e.g., European structural and investment funds, European Investment Bank programmes) available to complement national government funding ... §6.4.5
Recommendation 104: DiSSCo should investigate shared liability models for more effective financial planning and how these may lead to alternative business models for DiSSCo ... §6.4.5
End of summary list of recommendations.
This appendix contains a facsimile of the text of version 1.02, November 2019 of the FAIR Digital Object Framework (FDOF) Technical Implementation Guideline (TIG). Full details and the latest version of the TIG can be obtained at http://bit.ly/FAIRDO.
"We need a set of principles that are sufficiently specific to be useful but sufficiently abstract to exclude specific software stacks, i.e., a document that will still make sense and still be useful ten years from now."
This document includes some generic guidelines to be met (section A.3), a normative part defining the FAIR Digital Object Framework (FDOF) at an abstract level which will develop over time (section A.4) and a glossary of terms (section A.5). Related documents such as implementation examples can be found at the Github site*
See Table
Version |
Date |
Intention |
Actors |
Version 1.0 |
October 2019 |
prepared for the consensus meetings in Washington and Paris in October 2019 |
created by Luiz Bonino and Peter Wittenburg |
Version 1.01 |
17.11. 2019 |
created after the consensus meeting in Paris at 28/29.10.2019 |
changes by Luiz Bonino and Peter Wittenburg |
Version 1.02 |
22.11. 2019 |
created after various comments incl. polishing and adding clarity |
changes by Peter Wittenburg, Bonnie Carroll, Alex Hardisty, Mark Leggott, Carlo Zwölf |
Changes from V1.0 to V1.01
Changes from V1.01 to V1.02
Throughout section 3, introduced the term ‘PID record’ as the correct term for the structured record returned by PID resoluti
Some overall guidelines need to be met by the FAIR DO Framework (FDOF).
G1: Show a path for infrastructure investments for many decades.
G2: Demonstrate trustworthiness to researchers and developers to become engaged.
G3: Offer compliance with the FAIR principles through measurable indicators of FAIRness.
G4: Support machine actionability, which includes referential integrity and explicitness of semantic relationships.
G5: Support the abstraction principle, i.e., abstract away from the details that are not needed at a specific level. At the object management level there is no difference to be made between data, metadata, software, semantic assertions, etc.
G6: Support stable binding between all informational entities that are required for machines to act.
G7: Support encapsulation, which means that specific operations can be associated with different types of FDOs.
G8: Support technology independence, allowing implementations using different technologies
G9: Comply with agreed standards (e.g., for exchange of FDOs between systems, for interacting with FDOs, etc.) so that machine-machine interoperability can be achieved across heterogeneous systems.
The requirements for FDOF describe rules that must be met by all implementation of the FDO framework.
Note: Requirements will evolve, dependent on insights obtained from implementation experience.
FDOF1: A PID, standing for a globally unique, persistent and resolvable identifier, is assumed to be the basis of the Internet of FAIR Data and Services.
FDOF2: A PID resolves to a structured record (PID record) with attributes that are semantically defined within a type ontology (which can have different forms).
FDOF3: The structured PID record includes at least a reference to the location(s) where the bit-sequences encoding the content of a FAIR-DO (FDO) and the type definition of the FDO can be accessed. The structured record may also contain a PID pointing to a metadata DO (itself an FDO) describing properties of the target FDO.
FDOF4: The PID record may include other attributes that are important to characterize specific types of FDO or that are required by applications. Additional attributes being used in PID records must be registered in a type registry.
FDOF5: Each FDO identified by a PID can be accessed or operated on using an interface protocol by specifying the PID of a registered operation and the PID of the access point.
FDOF6: This protocol offers standard Create, Read, Update, Delete (CRUD) operations on FDOs and a possibility to use extended/domain operations for specific applications.
FDOF7: The relations between FDO Types and operations are maintained in a type ontology.
FDOF8: Metadata descriptions being themselves FDOs and describing the properties of the FDO must be made available as semantic assertions, enabling machines to act.
FDOF9: Metadata assertions can be of different types such as descriptive, deep scientific, provenance, system, access permissions, transactions, etc.
FDOF10: Metadata schemas are maintained by communities of practice. FDOF requires that such metadata are FAIR.
FDOF11: A collection of FDOs is also an FDO and semantic assertions must be used to describe their construction, i.e., the relationships of their constituents.
FDOF12: Deletion of a FDO must lead to standardised and thus machine interpretable tombstone notes in metadata and PID records, i.e., PIDs, PID records and metadata should normally not be deleted, but should be modified to indicate that the FDO associated with a particular PID no longer exists.
A short glossary with explanations about crucial terms such as "repository", "encapsulation" etc. will help in clarifications, since some terms may be interpreted differently by the participants (Table
Term |
Explanation |
abstraction |
Abstraction is a conceptual process where general rules and concepts are derived from the usage and classification of specific examples. literal signifiers, first principles or other methods (Wikipedia) |
binding |
With binding we mean the possibility for humans and machines to find other relevant entities of a DO when being exposed to another, i.e., when an actor receives a PID of a DO it must find the PID of the corresponding metadata DO and the access rights information, since otherwise interpretation and access is impossible |
collection |
A collection is a complex DO consisting of other DOs, which have a PID and metadata. |
CRUD operations |
These are the usual primary type of operations such as create, read/retrieve, update and delete |
encapsulation |
Encapsulation is known from abstract data types and object oriented programming where internals of data objects are hidden to the user and where the user can only influence the internal state by using defined methods |
Note: in the FDO case DO types can be associated with registered operations that can be used to operate on DO's content |
|
FAIR Digital Object (FAIR-DO) |
FAIR Digital Objects can represent data, software, protocols or other research resources. They are accompanied by persistent identifiers (PID) and metadata rich enough to enable them to be reliably found, used and cited.(FAIR Implementation Report: doi: 10.2777/1524, and Wittenburg and Strawn 2019 doi: 10.23728/b2share.2317b12321764f669c92ebbcf7518164) |
FAIRness |
FAIRness is a characteristic exhibited by an infrastructure component when it maintains compliance with the principles of FAIR. Achievement of FAIRness is demonstrated, for example by achieving a score (passing a threshold) in an assessment against an agreed set of maturity indicators. |
machine actionability |
machine actionability means the capacity of computational systems to find, access, interoperate and reuse data and services without human intervention (GOFAIR) |
metadata |
Metadata descriptions of DOs are sets of assertions describing properties of DOs content which are required for finding, accessing, interpreting and reusing, these assertions can cover a wide range such as descriptive to support finding, deep scientific to support science, systemic to support management, rights to prevent unauthorized access, etc. |
Note: Yet the domain of metadata is not structured very well, i.e., terminology is not well-defined. |
|
Note: Basic interoperability assumptions are that the schemas are registered and the concepts defined and registered. |
|
referential integrity |
The idea that all PID references must resolve and be valid without temporal limitation |
semantic assertion |
The attachment (perhaps by reference to a defined vocabulary) of a specific meaning to a resource, attribute, property, etc. |
repository |
DO View: from the perspective of Digital Objects repositories are nothing else than a complex DO associated with a PID, metadata of different kinds and functions to offer DOs |
Common View: from the most common point of view repositories are entities that host data, metadata etc., apply trustworthy management procedures, offer a search and access interface, have a team of experts taking care and have a sustainability plan |
|
Note: repositories can be associated with research organisations, communities or projects, they can be small or big in terms of the collections they hold. |
|
type |
"Type" is an attribute of digital objects which tells computational actors how the content of the DO needs to be parsed, i.e., it defines the operations that can be done on the data, the meaning of the data, and the way values of that type can be interpreted |
Note: A MIME type is a standard that indicates the nature and format of a document, file, or assortment of bytes, i.e., it is a restricted concept of type. |
|
Note: A type of a DO implies a summary of otherwise complex metadata assertions describing the format, encoding etc. of a content. |
This appendix describes the key requirements for persistently identifying Digital Specimen and other object types.
The persistent identifier (PID) system to be adopted by DiSSCo will sit alongside other identifier schemes, such as institution specific ones and CETAF Stable Identifiers*
From these aims, specific requirements flow. Somewhat in priority order, these are:
The biodiversity science and geoscience domains have established vocabularies and metadata schemes that must be accommodated as the basis of registry metadata schemes. Kernel information profiles must support the kinds of pre-resolution services foreseen as necessary for the domain, including for example: finding all specimens related to specific collecting/sampling event, finding all sub-samples and preparations from any specific specimen, finding all specimens of a specific kind, etc.
The work reported and the recommendations made in the present document are the result of a concerted effort over the period January 2018 – February 2020 by numerous experts participating in and contributing to the ICEDIG project. This has led to many interim ‘milestone’ reports and deliverables that form the basis of the present blueprint and recommendations for DiSSCo.
This project, ICEDIG – "Innovation and consolidation for large scale digitization of natural heritage" has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 777483. The ICEDIG consortium consists of 12 partners from 7 countries:
Additionally, participation of an iDigBio expert was made possible by the Advancing Digitization of Biodiversity Collections Program DBI-1547229 (2016-2021) from US National Science Foundation (NSF).
ICEDIG – “Innovation and consolidation for large scale digitisation of natural heritage”, Grant Agreement No. 777483
Implementation readiness is measured by the Implementation Readiness Level (IRL), defined as the measure of the ability of the organisation (DiSSCo) to embark on specific implementation actions (construction project) based on clear, actionable guidelines with minimum risk and across the scientific, data, financial, technological and organisational dimensions of the infrastructure implementation. DiSSCo Prepare will act as the main vehicle through which DiSSCo RI will raise its overall maturity and set itself in a position to implement its construction programme by i) improving the overall IRL, and ii) delivering the DiSSCo Construction Masterplan. Meeting these two high-level objectives will ensure that DiSSCo embarks on its construction phase with minimal but well-understood risks, and a clear and detailed construction plan that leads to the commencement of operations of the infrastructure by 2025 (as currently projected).
As of January 2020.
Identifiable in GBIF data as dwc:basisOfRecord equals PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, etc.
A global 'biodiversity commons' is the notion of a defined community of use for some defined information space, with guaranteed free and unhindered access to data and information for that community in that information space (http://www.dlib.org/dlib/june02/moritz/06moritz.html).
A voucher specimen is a preserved, representative sample of a natural object class used for identification and as supporting evidence of information learned during the research process. It serves as a verifiable and permanent record because it preserves as much of the physical remains as possible.
ICEDIG project deliverable D6.5, doi: 10.5281/zenodo.3465285.
ICEDIG project deliverable D6.6 doi: 10.5281/zenodo.3532937.
For example, DigiVol, Les Herbonautes, Zooniverse, Notes from Nature. See ICEDIG project deliverable D5.1 (section ‘evaluation of existing volunteer transcription systems, milestone MS26’); as well as BioSPEX from the USA.
Such a protocol has been developed. See ICEDIG project deliverable D5.1 (section ‘Specification of data exchange format for transcription platforms, milestone MS28’; also, doi: 10.5281/zenodo.2598413).
Noting, of course the need to deal appropriately with the personal data of volunteer transcribers in the context of the General Data Protection Regulation (GDPR). This is covered in the DiSSCo Data Management Plan, section 12 [
See doi: 10.15497/RDA00029 for the attribution metadata recommendation itself and doi: 10.5334/dsj-2019-054 for an explanatory article.
See ICEDIG project deliverable D5.3, doi: 10.5281/zenodo.3364541 where this is explored from practical perspectives, leading to five general business model principles.
See ICEDIG project deliverable D9.3, doi: 10.5281/zenodo.3632535.
ICEDIG project deliverable D6.4, doi: 10.5281/zenodo.3469490.
ICEDIG project deliverable D6.2, doi: 10.5281/zenodo.3364533.
ICEDIG project deliverable D6.3, doi: 10.5281/zenodo.3346782.
https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/infraeosc-03-2020. “Integration and consolidation of the existing pan-European access mechanism to public research infrastructures and commercial services through the EOSC Portal”.
General Data Protection Regulation (GDPR), Regulation (EU) 2016/679. Latest consolidated version (as of 21st January 2020) here: http://data.europa.eu/eli/reg/2016/679/2016-05-04.
Directive 2007/2/EC, INSPIRE. Latest consolidated version (as of 21st January 2020) here: http://data.europa.eu/eli/dir/2007/2/2019-06-26.
Commission Recommendation (EU) 2018/790 of 25 April 2018 on access to and preservation of scientific information, http://data.europa.eu/eli/reco/2018/790/oj.
These different approaches are analysed (with some cost comparisons) in the ICEDIG project deliverables D3.2 – D3.6.
ICEDIG project deliverable D2.1, doi: 10.5281/zenodo.2579156 identifies more than 100 criteria collated across these four categories and investigated for their importance through a community survey during 2018.
COVID-19, Ebola, Plague, hantavirus diseases, etc.
At the time of completing the present document (March 2020), the Dutch country page on the DiSSCo website was not yet available, pending the agreement of all the national institutions in the Netherlands. In the meantime, a temporary link to the mentioned CDD is this one: https://app.powerbi.com/view?r=eyJrIjoiNDQ1YmQzMzMtNmY5YS00MDQzLWI5M2YtNmRhOTM2MTg2NTU0IiwidCI6IjhjZDI0OTg0LTBhYTMtNGZjNS1iMDliLTRkNmVjZmFhNThmYiIsImMiOjl9.
Presented in ICEDIG deliverable report D2.2.
ICEDIG project deliverable D2.2, doi: 10.5281/zenodo.2582995, based on survey results and other data provides information about the characteristics of private collections in Europe.
For readers less familiar with some of the issues covered in this section, the EC-funded ENVRIplus project deliverable D5.1 “A consistent characterisation of existing and planned RIs” is a helpful resource; here: http://www.envriplus.eu/wp-content/uploads/2016/06/A-consistent-characterisation-of-RIs.pdf.
See the ‘Berlin presentation’ here: https://www.rd-alliance.org/rda-11th-plenary-joint-meeting-ig-data-fabric-wg-research-data-collections under agenda item ‘2.3 Digital Object Principles’ by Wittenburg/Strawn.
In systems engineering terms, a platform is a bed or core of stable hardware and software components upon which a variety of functions and services can be built to serve users. Evolvability to adjust to changing needs and demands is controlled to ensure protection of vital characteristics (see 2.3.1 and 4.1.2.3).
Being machine-actionable (i.e., knowing how to process the data in specific circumstances) is more than being machine-readable (i.e., understanding the data, its meaning and context).
See, for example https://osf.io/bsyac/ and doi: 10.3897/biss.3.37647 including its accompanying Powerpoint presentation, https://biss.pensoft.net/article/37647/download/media/372189/.
Exploratory work in the ICEDIG project has developed prototype software capable of reading Darwin Core Archives and transforming their content records into Digital Specimens. This software can be found at https://github.com/DiSSCo.
ICEDIG project deliverable D6.5, doi: 10.5281/zenodo.3465285.
See section 5.2.3 of the provisional data management plan for the DiSSCo infrastructure, [
Inherent security of Digital Specimens has been identified as a fundamental topic of the FAIR Digital Object Framework that will be addressed during 2020.
ICEDIG project deliverable D3.3, doi: https://doi.org/10.5281/zenodo.3364385.
ICEDIG project deliverable D3.5, doi: 10.5281/zenodo.3520667 describes state-of-the-art, potential technologies, and three experiments the ICEDIG project carried out to find new innovations in mass digitization of pinned insects.
ICEDIG project deliverable 3.6, doi: 10.5281/zenodo.3524263 examines the process of herbarium imaging and the preparatory steps needed. Some of the issues are common to any digitization project, but each step is important.
Described in ICEDIG project deliverable 3.6, doi: 10.5281/zenodo.3524263.
ICEDIG project deliverable D4.2, doi: 10.5281/zenodo.3364509.
For example, by using tools for data entry such as RightField, https://rightfield.org.uk/.
Reviewed in section 6 of ICEDIG project deliverable D4.2, doi: 10.5281/zenodo.3364509.
ICEDIG project deliverable D4.1, doi: https://doi.org/10.5281/zenodo.3364502.
ICEDIG project deliverable D3.7, doi: 10.5281/zenodo.3469531.
See, for example the work of oVert (an ADBC Thematic Collection Network) which is aiming to produce 3D images of more than 20,000 vertebrates: https://www.idigbio.org/wiki/index.php/OVert:_Open_Exploration_of_Vertebrate_Diversity_in_3D.
See section 5 of ICEDIG project deliverable D3.7, doi: 10.5281/zenodo.3469531.
The Community Standards for 3D Data Forum (CS3DP) has done some work in this area. See https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/community-standards-for-3d-data-preservation-cs3dp. This group was created to aid in the organization of nationally shared resources for the preservation and management of 3D digital research outputs and the development of shared community driven standards. Also, some work done by the oVert Thematic Collection Network (footnote 72), especially on archival strategies and access.
Traditionally, these can be thought of as quality assurance and quality control respectively. Quality assurance is process oriented and focuses on defect prevention, while quality control is product oriented and focuses on defect identification.
For more detail refer to each of the specific workflow sections in the ICEDIG project deliverable D3.1, doi: 10.5281/zenodo.3469521.
ICEDIG project deliverable D3.1, doi: 10.5281/zenodo.3469521. Recommendations are organised by collection type i.e., for i) microscopy slides; ii) skins and vertebrate material; iii) liquid preserved specimens; iv) pinned insects; v) herbarium sheets; and vi) 3d digitization. Recommendations cover: quality criteria to be met, the digitization workflows, the quality assurance activities to be performed during digitization, the quality control activities that are performed after digitization, and the software and hardware tools that can be used to support these quality management activities.
ICEDIG project deliverable D3.8, doi: 10.5281/zenodo.3719101.
Small and Medium Enterprises.
ICEDIG deliverables D6.2, doi: 10.5281/zenodo.3364533; D6.3, doi: 10.5281/zenodo.3346782; and D6.4, doi: 10.5281/zenodo.3469490.
Although at least one proposal has been made, by Diazgranados and Funk (2013), doi: 10.3897/phytokeys.25.5175.
Contrast this with the USA, for example where unionisation prevents the use of volunteers by some institutions.
ICEDIG project deliverable D5.3, doi: 10.5281/zenodo.3364541.
ICEDIG project deliverable D5.2, doi: 10.5281/zenodo.3364519.
PPSR-Core is a set of global, transdisciplinary data and metadata standards describing contextualized details about Public Participation in Scientific Research (PPSR) projects i.e., citizen science projects. For further details, see here: https://github.com/CitSciAssoc/DMWG-PPSR-Core/wiki/About-the-PPSR-Core.
ICEDIG project deliverable D4.3, doi: 10.1093/database/baz129.
ICEDIG project deliverable D4.4, doi: 10.5281/zenodo.3361598.
International Image Interoperability Framework – “triple-eye eff”, https://iiif.io/.
Like DOIs for journal articles, this will be based on the Handle system, https://www.dona.net/handle-system.
Note that at the time of writing (March 2020) the proposal is that NSId should be used to identify Digital Specimens whereas DOI are being considered as the basis for identifying Digital Collections.
Of course, as well as initial costs, the costs of replacement, upgrade and decommissioning must also be accounted for.
ICEDIG project deliverable D8.2, doi: 10.5281/zenodo.3724224.
Purchasing Power Standard (PPS) is the technical term used by Eurostat for the common currency in which national accounts aggregates are expressed when adjusted for price level differences using Purchasing Power Parities (PPP).
ICEDIG Milestone Report MS48 contains detailed information on competencies and skills profiles.
Commencement of the DiSSCo Prepare Preparatory Phase project in January 2020.
Access via a ‘European Collection Objects Index’ (ECOI).
As in short DOIs, shortened URLs e.g., bit.ly, compact identifiers (doi: 10.1038/sdata.2018.29) or readable letter combinations like ‘sofa-nice-face’.
Alternatives for second-level prefixes must be further studied. Here are two possibilities: i) In many cases, the registrar organisation will be the collection-holding institution (e.g., Natural History Museum London). So, using the Global Registry of Scientific Collections (https://www.gbif.org/grscicoll), a second-level prefix could be “NHMUK”, yielding (for example) ‘nsid: 12.nhmuk/’; and ii) Rather than registrar institution, a further classification of the digital specimen type is reflected at the second-level of the prefix. For scientists, being able to quickly understand the (taxonomic) group of specimens e.g. 12.B/ (for algae, fungi and plants) or 12.Z/ (for zoology) can be helpful.
Careful design of the NSId prefix scheme might provide the opportunity to include the current syntax of IGSNs, retaining backwards compatibility whilst moving them into a new top-level handle prefix. We understand that today IGSNs syntax mandates a concatenation of the namespace of the allocating agent and a unique in that agent identifier e.g. for NHM that would be NHMXXXXXXX. Retaining that syntax we can then have a nsid: 12.igsn/NHMXXXXXXX. IGSN retains its identity but within a larger universe of collections/samples handles.
We note, however the DOI Foundation strategy discussion, commenced earlier this year to terminate resolution of non-DOI handles by the doi.org resolver. We think this would be a retrograde step.