Research Ideas and Outcomes :
Project Report
|
Corresponding author: Louise Isager Ahl (louise.ahl@snm.ku.dk)
Received: 21 Oct 2023 | Published: 02 Nov 2023
© 2023 Louise Ahl, Luca Bellucci, Philippa Brewer, Pierre-Yves Gagnier, Helen Hardy, Elspeth Haston, Laurence Livermore, Sofie De Smedt, Helen Hardy, Henrik Enghoff
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Ahl LI, Bellucci L, Brewer P, Gagnier P-Y, Haston EM, Livermore L, De Smedt S, Hardy HM, Enghoff H (2023) Digitisation of natural history collections: criteria for prioritisation. Research Ideas and Outcomes 9: e114548. https://doi.org/10.3897/rio.9.e114548
|
There are approximately 1.5 billion specimens kept in European Natural History Collections. The mission for the Distributed System of Scientific Collections (DiSSCo) is to unite all these specimens into a one-stop e-science infrastructure of digital specimens. This is a monumental digitisation task and criteria for how to prioritise this effort are, therefore, crucial for the success of the project. In this report, we have reviewed the literature and designed and conducted surveys of the digitisation plans and criteria used by DiSSCo Partners to understand the prioritisation criteria used in the digitisation of natural history collections. As an attempt to provide some guidance for the digitisation of specimens, we suggest that an organisation (e.g. DiSSCo or an individual institution) that is planning to digitise natural history collections considers four categories of prioritisation criteria: Relevance, Data quality, Cost and Feasibility.
DiSSCo Prepare, DiSSCo, natural history collections, natural science collections, digitisation, prioritisation, digitization, prioritization
A core mission of the Distributed System of Scientific Collections (dissco.eu) is to unite the ~ 1.5 billion specimens kept in European Natural History Collections into a one-stop e-science infrastructure containing as many of these specimens as possible in the form of digital specimens (
This issue has been addressed in several previous publications, notably in a report from a GBIF taskforce (
The general picture emerging from previous studies (
The four groups embrace all prioritisation criteria which have been previously proposed and are described in detail in this report.
Data quality is given particular attention since this aspect of digitisation has been somewhat neglected in previous works. We have split this criterion into two main components:
1) How much information is there in each digital specimen? (Information level). This component has been addressed through the development of the MIDS concept (Minimum Information about a Digital Specimen,
2) How reliable is that information? Reliability includes accuracy (the closeness of measured values, observations or estimates to the true value) and precision (e.g. of geographical information: latitude/longitude in degrees only, in degrees plus minutes or in degrees plus minutes plus seconds or of taxonomic information: identification to genus, species or subspecies level).
The quality of data also includes the potential for quality assessment and improvement, as well as its completeness in terms of taxonomic, geographical or collection coverage.
Cost is obviously a major consideration in any digitisation project. We emphasise that cost estimates should include all costs associated with the project, including pre-digitisation, digitisation sensu strictu and post-digitisation) as highlighted in two case-studies in which we have analysed all costs associated with the digitisation of a herbarium and a collection of fossils. Cost in relation to prioritisation includes both affordability (can the project be achieved within the resources available and in relation to any funding opportunities?) and value for money - whether the costs are reasonable in relation to the intended benefit or impact.
It has become obvious that there is no easy way to implement the multitude of criteria. The idea of an algorithm such as a “decision tree” seems unviable and we suggest that projects be evaluated/prioritised by a combination of a scoring method and a panel discussion, similar to what has been done in the series of SYNTHESYS projects*
We strongly recommend collaboration, for example, at DiSSCo level, in order to optimise resources and we want to underline that, irrespective of which criteria are considered, there is no fit-all solution. Flexibility is essential, depending on the intended use of the digital specimens to be generated; the resources available; and in order to respond to opportunities.
We provide a list of questions to be considered in connection with the drafting or evaluation of digitisation projects.
Finally, we stress that digital specimens can never replace the physical specimens that exist in collections and that ensuring the long-term preservation of the collections remains a top priority.
This project report was written as a formal Deliverable (D1.3) of the DiSSCo Prepare Project (
The following text is the formal task description (Task 1.3) from the DiSSCo Prepare project's Description of the Action (workplan):
"Based on the analysis of previous studies, relevant criteria will be identified and developed into a basic model for the prioritisation of digitisation of objects held in NSCs. Criteria to be considered include scientific relevance, user needs, socioeconomic impact, specialisation, technical feasibility and cost".
Natural history collections are treasure troves for scientists and, in order to safeguard and expand the use of these collections for the future, digitisation is pivotal. Attempts to digitise natural history collections throughout the world have already started. The Distributed System of Scientific Collections (DiSSCo) is a pan-European Research Infrastructure (RI) for natural science collections. The aim of this infrastructure initiative is to unify all European natural science assets under common access, curation, policies and practices. This approach and set-up will ensure that all the data is easily Findable, Accessible, Interoperable and Reusable (FAIR principles - see also
Digitisation in this context spans the spectrum from making basic information on a specimen (name, collecting locality etc.) digitally available, to including (or linking to) digital images (photographs, X-rays, scanning electron micrographs etc.), DNA sequences, chemical information and other data in the digitised information. These rich, linked specimen data have been referred to as the "Extended Specimen" (
Digitisation can be approached in different ways:
In Europe alone, there are an estimated 1.5 billion specimens stored in collections, representing nearly 80% of described species worldwide (
Within institutions, prioritisation may need to take into account all of the four categories above, in a ‘balanced portfolio’ approach that, for instance, ensures mass digitisation projects are balanced against user-led services and the need for innovation or more bespoke pilots or the need to make equipment available for business as usual. For DiSSCo, prioritisation of what to digitise is perhaps most critical in relation to the coordination of mass digitisation programmes and/or larger project-based digitisation, as these will primarily drive critical mass of content creation through the DiSSCo infrastructure. It is also likely that central coordination of on-demand approaches may be required; however, this is less a question of prioritisation - which, by definition, is user-led in these services - and more one of service design, funding etc. Mass or larger project digitisation activities are, therefore, the main (but not only) focus of this report. Technical approaches to digitisation are a related and overlapping subject, but this will not explicitly be dealt with here unless it is of direct relevance to the discussion.
The crucial question can briefly be framed as "Where to start?". Another crucial consideration is: "to what extent should decisions be made at a European or global level, rather than in individual collection-holding institutions"? A coordinated approach would allow us to focus more efficiently on solving specific problems that have a wide and significant impact on all of us, for example, by assembling critical mass of relevant data to address key societal challenges; or by enabling the most efficient and effective workflows to be deployed widely with maximum impact. Here, DiSSCo offers a unique opportunity for coordinating prioritisation, though it should also be recognised that each institution will have their own drivers and stakeholder requirements that will impact the prioritisation process (not least in that different institutions hold different types of collections and objects, which they will naturally see as their priorities).
There are few descriptions and models available for prioritisation of digitisation targeting natural history collections. Many potential factors may influence the decision-making process regarding prioritisation and the present paper is to be seen as a help to “establish relevant criteria to identify a prioritisation model for digitisation” (DPP Description of Work). To obtain a better understanding of what has been done in the past and what is included in current digitisation programmes, we carried out the following:
Additionally, we obtained detailed information of all costs associated with two digitisation projects that have been carried out in recent years.
At the onset of this project, two core studies were available on the topic of digitisation. The most recent work was carried out in the ICEDIG project and reported in the final deliverable “Inventory of criteria for prioritisation of digitisation of natural history collections” (
For the 2021 survey, works deemed to be relevant were scored (1-3), based on relevance for the investigation with 1 being most relevant. The searches were carried out in Google Scholar with the following search parameters:
In comparison to the results presented by
Table 1. Results of the four search compilations undertaken in April 2021 and June 2022.
April 2021 | June 2022 | |||
Search no. | No. results | No. relevant | No. results | No. relevant |
1 | 143 | 4 | 223 | 6 |
2 | 775 | 4 | 1170 | 4 |
3 | 4460 | 2 | 4640 | 2 |
4 | 46 | 2 | 46 | 2 |
The 2022 survey was carried out under much broader criteria and resulted in a large number of publications (see Suppl. material
In addition to the literature study, two surveys were carried out amongst DiSSCo partners: one coverng their digitisation strategy if present and one covering the prioritisation criteria they used for digitisation completed or in progress.
DiSSCo partners were asked to provide information, in free text and preferably no more than 2 A4 pages, on:
The following guiding questions were supplied to highlight relevant topics:
It was suggested that, in their answers, it could be useful to distinguish between:
This study was carried out in the autumn and early winter of 2021.
In Suppl. material
The multitude of thoughts, approaches and results described by respondents to the essay-based questionnaire makes interesting reading although, as expected, the format makes it difficult to quantify or even to describe the results in a few paragraphs or diagrams. Therefore, we subsequently developed a short multiple-choice questionnaire focused on the digitisation activity, using a Google Form. The short questionnaire, after being reviewed by the task partners, was sent to all DiSSCo National Nodes who shared it with their own institutions in order to collect information from as many institutions as possible involved in DiSSCo. To facilitate the dissemination, the questionnaire was translated into different languages (English, Danish, French, Italian and Dutch). An overview of the questions and answers can be found in Suppl. material
The structure of the questionnaire was as follows:
This study was carried out in spring of 2022.
In addition to the prior costbook work (
The most significant results obtained through the literature review were reports carried out by GBIF (2016) and within the DiSSCo-related project ICEDIG*
A task force was convened by GBIF “to help accelerate the discovery, digitisation and access to biocollections data”. One of the task force’s main objectives was to provide guidance on establishing priorities for digitising biocollections to serve institutional, national, and global needs and achieve the greatest economies of scale (
The most important priorities identified by the GBIF task force were reported to be:
However, these findings are only in part compatible with the most important criteria found by ICEDIG (see below).
ICEDIG was an EC-funded project under the Horizon 2020 Framework*
For the questions regarding prioritisation,
Based on the additional information added in free text, an extensive and revised list of criteria was assembled on six overarching topics:
We note that there is some overlap between all of these topics.
Due to the broad range of criteria that were identified to be of importance in the process of prioritising digitisation efforts, three possible methods to determine the strategy for a digitisation project were proposed: 1) Decision tree; 2) Scoring method and 3) Panel review.
Although relevant publications were identified through the additional literature survey (Suppl. material
Two surveys were carried amongst DiSSCo partners on their digitisation strategy (if existing), as well as on which prioritisation criteria they employed for digitisation which had already been done or was in progress. The main findings have been summarised here and the complete responses can be found in Suppl. material
The natural history collections that replied to our questions are at different levels in their digitisation efforts. This means that the answers reflect whatever level they are at and are, therefore, hard to sum up in a coherent way as they varied from “all our collections have been digitised” to “we have no official document outlining our digitisation priorities”. However, most seem to adhere to the criteria put forward by
In terms of prioritisation criteria employed for digitisation efforts, many respondents had left this blank or indicated that internal work was in progress to define their approach. It is, therefore, not possible to extract general tendencies. Instead, we present, as a concrete example, the key criteria for digitisation efforts employed by the Natural History Museum of Denmark:
Of the 23 national nodes, only 10 answered, with a total of 79 answers. Most of the answers came from NH Museums or University Museums and Research Institutions. Thus, most respondents are curators, several are researchers or directors of the collections and a few are digital collection managers or similar (Suppl. material
In general, the size of team is proportional to the size of collection with some a few exceptions: five large or very large collections have a small team, six small collections have medium-sized teams and one very small collection has a large team (Suppl. material
Digitisation seems to be primarily driven by “Projects (e.g. E-Recolnat, national lists of flora or fauna etc.)” and “Opportunistic digitisation (e.g. moving the collection into a new site, out-going loans, new specimens entering the collection, exhibition and other contingent events)”. The “Digitisation on demand (i.e. ad hoc digitisation for specific research, as requested by external researchers, for example, through VA SYNTHESYS+)” is the third choice in the decision process described by
The short questionnaire highlighted that almost all the institutions share the same digitisation priorities as follows (see Suppl. material
Overall performance in respect to human resources and tools;
Overall performance in respect to financial resources;
Faster digitisation improving cost/volume rate.
Therefore, the “Scientific relevance” of a collection is the key element that drives digitisation, the taxonomic and the geographic relevance are the most important sub-criteria in this category; if the collection has an institutional importance (maybe for funding programmes), the priority for its digitisation is boosted.
A total of 70% of the respondents declared that their institution has a clear overview of the digitisation status (how many specimens are in the database, how many imaged, open access database etc.), but for most, the database is not in open access. The digitisation status is monitored by automated means in less than 20%, while the remaining 80% are divided between “no monitoring in place” or “monitoring by extracting the needed information through different databases or sources”. A single CMS is used by a small percentage (28%), whereas 50% do not have a CMS, but use traditional databases (e.g. Access, Excel files) (Suppl. material
Regarding information about digitised items (Suppl. material
The answers showed that MIDS3 level has the lowest percentage for almost all the collections (n = 41); while MIDS2 is the best «compromise» since it provides considerable information, while not being too demanding. The expected decreasing trend from MIDS0 to MIDS3 was not clear in the replies, probably because some respondents did not answer by following the suggested logic “MIDS0 ≥ MIDS1 ≥ MIDS2 ≥ MIDS3” in the question; observing the single answers, they probably reported the values by subtracting the number of digitised specimens at one level from the total digitised. There is a low percentage of imaged items and 3D models, this probably being due to lack of specific tools/technologies and a larger repository for data.
Finally, the replies have highlighted how funding, particularly for employed dedicated staff, is crucial for planning a digitisation strategy.
The multiple-choice questionnaire can be found in Suppl. material
Cost is an important consideration in any digitisation project, it often constitutes a criterion overruling other considerations, either because projects are not considered to be affordable (they cannot be achieved within available resources) or, perhaps, because the value for money of pursuing them is not considered sufficient. We found that most of the published cost analyses of digitisation, including the in-depth analysis made in the context of the ICEDIG project (
This mass-digitisation project at the Natural History Museum of Denmark (NHMD) was initiated in 2019 and was completed in May 2023. The project was partly financed by a grant (2.2 million DKK ~ 295,000 euro) from the Aage V. Jensen Charity Foundation and NHMD invested considerable additional resources from its internal collection budget.
The aim of the project was to digitise the Greenlandic vascular plant herbarium, including transcription and georeferencing. The collection is significant as it is the large collection of plants from Greenland and includes a significant proportion of historical material. The project is summarised in more detailed by
Table
Expenses associated with the digitisation of the Greenland Herbarium at NHMD. Important: the cost for each item consists of cash costs plus time costs; conversion of time (hours) to cash (euro or other currency) has not been attempted. *71,879 out of 170,000 records had been transcribed, cleaned and imported into Specify as per August 2022; this required 128 hours. The figure in the Table, 303 = 128 × 170,000/71,879.
Process | Cash Cost (EUR) | Duration (Hours) | Notes |
---|---|---|---|
Imaging of 147,500 sheets and 15,900 folders | 109,150 | Not recorded | done by external contractor, paid by grant |
Transcription of 170,000 labels* |
103,700 | Not recorded | done by external contractor, paid by grant |
Transport of specimens, materials and professional freezing services | 12,500 | Not recorded | done by external contractor, paid by grant |
Project management | Not recorded | 960 | 800 hours paid by grant, rest by NHMD |
Packing of collection | Not recorded | 160 | paid by grant |
Data management | Not recorded | 303* | small part paid by grant, rest by NHMD |
Collection management | Not recorded | 175 | paid by NHMD |
Student assistance (data cleaning etc) | Not recorded | 158 | partly paid by grant, rest by NHMD |
Total | 225,350 | 1581 hours | Total cost = cash (euro) plus time (hours) |
Item | Time spent | Time upscaled to 170,000 specimens (rounded to hours) | Notes |
cleaning collector names – clustering | 60 min | 42 hours | |
cleaning taxonomy – clustering | 15 min | 11 hours | |
cleaning author names | 10 min | 7 hours | |
cleaning infraspecific taxonomy - clustering | 10 min | 7 hours | |
cleaning locality – clustering | 90 min | 63 hours | variable, depends on original data quality |
uploading images | 1 min | 1 hour | usually scheduled to happen during night |
Total |
3 hours 6 min |
131 hours |
This 3D digitisation was initiated in 2020 and finished in 2022 thanks to Tuscany Region Postdoc Grants in Cultural Heritage 2018 (“POR FSE 2014-2020 Asse A – Occupazione”). This project entitled “Virtual paleontology - a non-invasive approach for the fruition, diffusion and sharing of the paleontological heritage” (PalVirt) was carried out by Dr. Saverio Bartolini Lucenti and was the first example in Italy of the systematic and massive 3D digitisation of paleontological type-specimens, in particular 138 vertebrates (almost all) and 69 invertebrates and plants. Three partners were involved in the project: the Earth Science Dept. – Paleo[Fab]Lab, the Geology and Paleontology Museum and Tbnet Soluzioni3d srl (Arezzo). For further information, see
Item | Cash cost (€) | Time cost (hours) | Notes |
3D models of 200 fossil specimens (acquisition and elaboration) | 56,000 | 792 | done by external contractor, paid by grant |
Project coordinator | Not recorded | 176 | paid by NHM UniFi |
Collection manager (Project Referent) | Not recorded | 352 | paid by NHM UniFi |
Collection managers | Not recorded | 176 | paid by NHM UniFi |
Total | 56,000 | 1496 hours | Total cost = cash (euro) plus time (hours) |
The results from both the essay-based and the multiple-choice questionnaire, like the results from the literature studies, highlighted the extreme complexity of prioritisation. Fulfilling the ambition of DiSSCo, to digitise millions of specimens in all possible shapes, sizes, origins, ages, state and value, is indeed a daunting task. The very high number of prioritisation criteria that have been suggested may appear as a barrier to progress for many institutions or may need to be balanced at an organisational level for example, to meet strategic or funding opportunities, while also carrying out projects to develop new digitisation workflows or to meet the needs of particular users. An organisation planning a digitisation project needs to consider whether, for example, scientific relevance should be a guiding principle (and define what this means in their specific case) and/or what the funding opportunities are and/or what data quality can be obtained with the resources at hand and/or what the societal interest in the digital specimens to be created is.
With the aim to facilitate decisions about prioritisation of digitisation to be taken by DiSSCo or by individual institutions, we here offer a classification of the multitude of possible criteria into four main categories. Based on our literature study and the results of our surveys, we propose the following four categories:
All criteria that have been suggested previously fall into one (or more) of the four groups which are, thus, not new criteria, but are meant as an aid to reduce the multi-dimensionality of the “criterion space” during the first steps in the prioritisation process.
The categories of criteria are not completely mutually exclusive. For example, “Cost” may be seen as a component of “Feasibility” an indeed, cost considerations often overrule other criteria. In spite of the somewhat simplistic classification of prioritisation criteria presented above, prioritisation remains a very complex task. It is important to bear in mind that considering just one criterion or just one category of criteria in isolation, will not result in a sound prioritisation. All categories need to be considered, as visualised in Fig.
Interrelation of the four main categories of criteria. Data quality and cost are represented on the horizontal and vertical axes (axis values are arbitrary). Relevance is represented by the size of the circles and feasibility by the intensity of their colour. Project A and B will both deliver data of high quality and high relevance. Although Project B data will be of slightly lower quality and slightly higher cost, this project may be chosen because of higher feasibility. Project C has little to recommend it, whereas Project D (low data quality, medium relevance and feasibility and low cost) might be prioritised depending on what the data will primarily be used for.
Relevance may be seen as the primary criterion for prioritising digitisation. If the digitised specimens to be generated are of low relevance, i.e., will lead to no benefit or have no impact, other types of criteria (data quality, cost, feasibility) become almost irrelevant.
Different kinds of users have different needs: what is seen as most relevant for one may not be most relevant for another. According to the comprehensive ICEDIG study (
Types of information to be included in digital biological specimens depending on intended use.
PRIMARY USE OF DIGITISED SPECIMENS | |||||
TYPES OF INFORMATION INCLUDED | Taxonomic research | Other types of fundamental research (e.g. biogeographical, ecological) | Applied research (e.g. medical) | Conservation/ land use | Outreach |
Taxonomy | + | + | + | + | + |
Georeference | + | + | + | ||
Images | + | + | |||
Habitat info | + | + | + | ||
Sequence data | + | + | + |
There are two further complexities in relation to using scientific relevance as a guide to prioritisation in DiSSCo. Firstly, it is likely that almost all collection objects where sufficient data are present have scientific relevance against one or more of the types of research mentioned above. Deciding which of these purposes are ‘most’ important or relevant is extremely challenging. Secondly, this relies on our current understanding of what is important, relevant and useful - but a key benefit sought through digitisation is to unlock new avenues and paradigms of research, for example joining up collections data to other data sources in ways which have not previously been explored. Again, this makes judgements of scientific relevance, based on today’s evidence inherently flawed, although still worthwhile as one of the criteria to provide information on prioritisation. Irrespective of how carefully relevance criteria are analysed, nothing is immutable. Like prioritisation in general, scientific relevance may change over time as institutions and researchers change their focus.
Much of the existing research prioritisation focuses on scientific research. The low prioritisation of 'social-relevant criteria' or social relevance (
As a thought experiment, consider two digitised collection: one with 100,000 digitised specimens and a second with 1,000,000 digitised specimens. At first glance we might consider the latter more advanced in terms of quantity of digital specimens. However, what is the quality of the digital specimens in the two collections? When planning and assessing digitisation, data quality needs to be taken into consideration although this aspect has not been very much considered in previous studies. See
There are two main dimensions of data quality:
A third essential aspect of data quality is potential for validation and improvement:
Discussion of data quality is also not independent of the relevance criteria discussed above - the reason data quality is important has to do with whether data are ‘research-ready’ and impactful. There may be areas of data quality, such as high quality geo-referencing, that are relevant to widespread fields of research; but other areas of detail which are critical for particular studies, but less valuable to widespread users. It is also often the case that a few key data fields from a large volume of specimens may be more valuable than deep and detailed data on just a handful of objects - again, it depends on the potential uses and users. Ultimately, however, it is reasonable to say that if data about specimens are clearly poor or lacking (e.g. labels are missing, damaged etc.), those specimens are unlikely to achieve much impact through digitisation. These points are explored further below.
A digitised specimen may be anything from a textual record with minimal information (e.g. species name) to an extended digital specimen represented by full collection information, illustrations in the form of photos and CT scans, morphometric data, DNA sequences, sound recordings, chemical profiles and with links to related data and resources.
In order to quantify the information level of digital specimens, a digitisation standard has been developed. The Minimum Information about a Digital Specimen (MIDS) standard (
Four levels of MIDS (Minimum Information about a Digital Specimen). From
MIDS level | Record extent | Purpose |
---|---|---|
1 | Basic | A basic record of specimen information. |
2 | Regular | Key information fields that have been agreed over time as essential for most scientific purposes. |
3 | Extended | Other data present or information known about the specimen, including links to third-party sources. |
0 (Note) | Bare | A bare or skeletal record making the association between an identifier of a physical specimen and its digital representation, allowing for unambiguous attachment of all other information. |
The level of information required varies significantly depending on what the data are being used for. Planning and costing a digitisation programme potentially requires a low level of information; some ‘big data’ analyses, including species distributions, require an additional set of data; whilst taxonomic research may require all the data that are available on the specimen. Mass digitisation programmes are commonly taking a staged approach to capturing information, starting at the basic level (MIDS, Level 1) and using a range of options, including outsourcing and crowdsourcing, to transcribe additional data and reach a higher digitisation level. The extended record (MIDS, Level 3) equates to the DiSSCo open Digital Specimen specification (
An example of a digitised specimen with a very high information level can be considered a digital surrogate. This concept was described by
A CT scan of the millipede described by
However, while digitisation of type specimens to a high level of detail has many benefits, it does not enable 'big data' type analyses, such as species distributions which are critical to understanding environmental change - it is likely that a balance is required in prioritisation between detailed data on some specimens and lower levels of data on many specimens.
Reliability (data quality in the strict sense) was treated in detail by
For all these components of a data-point, but especially obvious for spatial data, their accuracy and precision need to be considered. Accuracy and precision are often confused: accuracy refers to the closeness of measured values, observations or estimates to the real or true value, whereas precision includes statistical precision (the closeness with which repeated observations conform to themselves) and numerical precision (the number of significant digits that, for example, decimal latitude/longitude is recorded in) (
The differences between accuracy and precision in a spatial context. The red spots show the true location, the black spots represent the locations as reported by a collector. Far left - High precision, low accuracy. Middle left - Low precision, low accuracy showing random error. Middle right - Low precision, high accuracy. Far right - High precision and high accuracy. From
Ideally, all data-points would have high accuracy and high precision. However, for some purposes, high precision is not necessary for the data to be “fit for use”. This is illustrated in Fig.
Irrespective of how carefully a dataset has been prepared, very few datasets – if any at all – are guaranteed error-free. Therefore, quality assessment and data cleaning are important aspects of digitisation.
For DiSSCo, four types of information are particularly relevant: 1) taxonomic and nomenclatural information, 2) spatial information (georeferencing), 3) collection date and 4) image quality. For fossils, 5) geological age is also essential. Concerning types 1–3, data cleaning was treated in detail by
Quality control should be done by experts with access to both the physical and digitised collections. When voucher specimens are kept in a collection, the accuracy and precision of the taxonomic/nomenclatural information can be checked by a specialist at any time, but this seldom applies to the accuracy and precision of data on location, date, collector, habitat etc. Hence a great responsibility for accuracy and precision in recording rests on the collectors themselves. An alternative approach is to use a range of online tools, such as the data quality control checks within aggregators, such as the Global Biodiversity Information Facility (GBIF) and SpeciesLink, which include checks on geocoordinates, taxon names and date formats. GBIF also provides a list of tools which include some that support assessing and improving biodiveristy data quality (https://www.gbif.org/resource/search?contentType=tool)*
Manual data cleaning, for example, by taxonomic specialists or curators, will continue to be important. For example, the identification of collectors’ itineraries allows for checking for possible errors if, for example, the date of collection does not fit the particular pattern of that collector (
In the framework of the SYNTHESYS+ project,
Finally, as always, a balanced view is recommendable. It is better to release imperfect data than to hold data back in the pursuit of (impossible?) perfection. Releasing (imperfect) digital data can help to improve data quality, for example, by opening it up to comment from international experts remotely.
Cost considerations, including funding opportunities and the affordability of projects within available resources, will have a big impact as to what is prioritised in a digitisation project. The cost of digitisation has been the subject of many analyses – recent examples are
Another useful classification described by
In particular, the costs of preserving digitised data are often neglected or underestimated, although they may constitute a very significant part of digitisation costs. See, for example, the case studies of costs in the present report. While cost, including funding opportunities, is likely to be critical to any decision to undertake digitisation, focusing on this cost alone is problematic if DiSSCo only prioritises specimens which are cheapest to digitise. Cost needs to be taken into account alongside the other criteria and is perhaps better expressed and understood as ‘value for money’ - the most advantageous combination of cost and quality (or likely impact) or, in other words, whether it is cost-effective to digitise certain things, because there is a feasible workflow; scientific or other relevance that will make the data impactful; sufficient data available; and funding to meet the expected costs. Cost data will be added to some of the workflows in DiSSCo’s digitisation guides website (https://dissco.github.io/) and to the “digit-key” (https://digit.naturalheritage.be/digit-key) being developed by the Royal Belgian Institute of Natural Sciences.
The feasibility of a digitisation project is, of course, dependent on available funds. In other words, cost might be seen as one aspect of feasibility. However, cost considerations aside, there are other factors that determine a project’s feasibility: Is the collection ready to be digitised? Are skilled staff available? Is the IT and other technical infrastructure geared to the task? Has a digitisation workflow been tested and established at a suitable scale?
“IT and other technical infrastructure” includes such things as cameras/scanners, conveyor belts etc., but also computing power, appropriate software, storage space and back-up options.
The human and other resources necessary for a successful project vary according to the type of specimen and the project scale. It has become known that digitisation (including at mass scale) of herbarium sheets is relatively easy. For collections of dried insects (which in terms of sheer specimen numbers constitute a very large, if not the largest part of DiSSCo’s collections), methods are being developed for efficient mass digitisation of the specimens and the associated labels (
The human and other resources necessary for a successful project also vary according to the desired level of data quality, including information level (e.g. MIDS), accuracy and precision.
Many, especially smaller, institutions will have difficulties mustering the necessary resources to make a digitisation project feasible. Collaboration may ameliorate this situation. DiSSCo provides a unique opportunity, not only for sharing and learning from best practice workflows which can improve feasibility, but also for direct collaboration on digitisation. The efficiency and potential impact of the digitisation of natural history collections will be immensely higher if DiSSCo-wide agreements can be made. At the DiSSCo level, it may also be possible to apply for European funds to carry out large-scale digitisation projects. DiSSCo-wide digitisation targets could be of the following types (hypothetical examples):
Despite the complicated nature of the matter, the “academic” presentation of various types of criteria for prioritisation is relatively straightforward. In contrast, their practical implementation is anything but straightforward. All analyses show that there is no such thing as one primary criterion taking precedence over others.
Concerning the decision tree,
When the DiSSCo RI becomes fully operational, it is expected that prioritisation of digitisation will, at least in part, take place at DiSSCo level. Whereas it is beyond the scope of the present report to suggest which specimens to digitise first, the preceding sections provide a background for making optimal decisions.
When choosing what to digitise and how to do it, consider:
More specifically, consider:
To gather the information required for prioritisation, whether for evaluation or preparation of project proposals or for preparing an internal strategy, the following questions are recommended:
RELEVANCE:
COST:
QUALITY:
FEASIBILITY:
Finally, whereas prioritisation of digitisation is the subject of the present report, it is important to remember that the digital specimens that have been and will be created, still need links to the physical specimens since physical specimens always will be the ultimate (potential) validators (or 'vouchers') for digital data. Irrespective of the “digital revolution” in which DiSSCo takes part, physical collections, therefore, will need continued funding, including funding for skilled curators. This priority for digitisation of natural history collections is as high as any other.
We extend our thanks to all those persons who have provided information, either in the form of response to our surveys or in a more informal way. Particular thanks to Tim Robertson (ORCID: 0000-0001-6215-3617) from the GBIF secretariat and to Arthur Chapman (ORCID: 0000-0003-1700-6962) from the Australian Biodiversity Information Services for permission to use illustrations from GBIF reports.
Distributed System of Scientific Collections - Preparatory Phase Project (DiSSCo Prepare). Grant agreement ID: 871043.
Summary of relevant data from the ICEDIG project and the GBIF task report.
Analysis of previous studies, identify relevant criteria and develop them into a basic model for the prioritisation of digitisation of objects held in Natural Sciences Collections (NSCs).
The combined list of new and relevant studies found through two searches.
The EU-funded ICEDIG project – “Innovation and Consolidation for Large Scale Digitisation of Natural Heritage” - aimed to support the implementation phase of the new Research Infrastructure DiSSCo (“Distributed System of Scientific Collections”) by designing and addressing the technical, financial, policy and governance aspects necessary to operate such a large distributed initiative for natural sciences collections across Europe. The ICEDIG project ran just over two years (January 2018 to March 2020).
In
contributing to conservation (policy); underpinning importance of collections to stakeholders and public; contributing to appearance and profile of institution; contributing to solving societal challenges and issues (health, agriculture, climate); extending networking and cooperation beyond traditional domain; complying with legal rules and regulations.
SYNTHESYS (https://www.synthesys.info/about-synthesys.html) has run successfully from 2004 to 2023 and, as a core activity, has funded short transnational research visits to a considerable number of European collections. In the latest version of the project, SYNTHESYS+, a virtual access grant scheme to fund smaller digitisation projects of the collections, was included as well. Applications for transnational and virtual access in SYNTHESYS are prioritised and funded, based on a combination of scoring and panel review. Applications are submitted using a structured form and applications are evaluated and scored by a panel of experts. Importantly, prioritisation and funding are not decided on the basis of the panel scores alone, but are discussed at a panel meeting where aspects that cannot easily be assigned a numerical score can also be discussed and considered.
NB: Especially, but not exclusively for mass digitisation, a pilot phase testing a new digitisation workflow and/or technology, is recommendable.
A query of GBIF on 01-09-2023 for occurrence records from the "Data network=Distributed System of Scientific Collections (DiSSCo)" and "Basis of record=Preserved specimen" returned the following summary report:
Total: 39,679,015
Licence: CC BY-NC 4.0
Year range: 1501–2023
With year: 58 %
With coordinates: 33 %
With taxon match: 98 %
This query has been saved:
These two subcategories had equal relevance.
NB: Economic relevance ranked as equally important as educational relevance.
These definitions of MIDS level differ from the more recent version of
Specimen label transcription included:
As of 2023-08-29 there were 112 tools listed including a mix of general tools (like QGIS and R) to specific biodiversity data tools (like a Georeferencing Calculator and GBIF's scientific name parser).