Research Ideas and Outcomes :
Project Report
|
Corresponding author: Helen Hardy (h.hardy@nhm.ac.uk)
Received: 27 Sep 2023 | Published: 02 Oct 2023
© 2023 Helen Hardy, Laurence Livermore, Paul Kersey, Ken Norris, Vincent Smith
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Hardy H, Livermore L, Kersey P, Norris K, Smith VS (2023) Understanding the users and uses of UK Natural History Collections. Research Ideas and Outcomes 9: e113378. https://doi.org/10.3897/rio.9.e113378
|
|
UK natural science collections hold over 137 million items, an unrivalled source of data about 4.56 billion years of planetary development and hundreds of years of biological change, including the differences made by humans — but the scientific, commercial, and societal benefits of these collections are constrained by the limits of physical access, and by highly fragmented digitisation efforts with less than 10% digitally available. Following work with Frontier Economics in 2021, which showed potential for £2 billion in benefits to the UK economy from digitising all UK natural science collections, in 2022–23 the Natural History Museum London worked, with analytical support from McKinsey and Company, to understand the impact of what has already been digitised and shared by UK natural science collections — what is the demand for these data, what are they used for, and how does this deliver efficient, effective and impactful research?
This study focuses on usage via the Global Biodiversity Information Facility, the largest source of relevant usage data, examining 7.6 million records from twelve UK institutions. While these UK collections data are just 0.3% of total GBIF occurrences, they are cited in 12% of peer reviewed publications citing GBIF data, showing the disproportionate impact of UK collections data and the historical, geographical, and taxonomic richness that they bring. Researchers have already benefited from more than £18 million of efficiency savings from digital UK specimen data. Data from natural science collections held in the UK are uniquely impactful resources, vital to a future in which people and planet thrive, and a step change in the pace of digitisation is needed to unlock their potential for researchers, policymakers, and society.
natural history collections, natural science collections, value of research, value of collections, collections users, economic benefits, digitisation, digitization, collections impact, conservation, biodiversity
The Distributed System of Scientific Collections UK (DiSSCO UK, https://www.dissco-uk.org) is a partnership of UK natural science collections, led by the Natural History Museum, London. It aims to unlock and harness the power of natural science collections data as vital infrastructure for research into the key challenges facing humanity and the planet.
Supported by the Arts & Humanities Research Council (AHRC — who are responsible for UK heritage collections as research infrastructures), DiSSCo UK surveyed institutions holding natural science collections in 2022 (
A key part of making the case to unlock the potential of UK natural science collection data is to understand the benefits in more detail. Following previous work (
Analysis was conducted between November 2022 and January 2023, combining quantitative and qualitative approaches to investigate the characteristics of uploaded data, users and uses of data, and the value created by that usage. This is a rapidly evolving area, with more data being released every week and new uses developed.
We looked for data sources that were openly accessible; in possession of consistent data; offering good metadata and/or access to data for analysis; and recognised as preeminent data sources within their domain. While we considered a range of sources, including for example institutional data portals, by far the largest-scale and most consistent source of FAIR (findable, accessible, interoperable and resuable) collections data with consistent quantitative evidence about users and usage is the Global Biodiversity Information Facility (GBIF — www.gbif.org), so GBIF data are the primary basis for this study. GBIF is a globally recognised resource, designated a global core biodata resource in December 2022 by the Global Biodata Coalition (which includes UK Research and Investment) (
Recent research (
The key insights generated from this study are as follows:
A short, summary paper of the key insights and context around this study is also available (
We undertook both a quantitative and qualitative analysis, of which the quantitative was much more extensive. The queries used to generate the source data are provided in Jupyter notebooks (Suppl. material
In order to identify relevant UK collections data for this analysis, we needed to identify UK institutions who currently publish collections data. There are more than 90 institutions in the UK who hold natural science collections, however the majority of these do not yet publish specimen data to GBIF.
The NHM provided a list of relevant institutions who hold collections in the UK, including museums, botanic gardens, universities and specialist centres and societies. A subset of 22 institutions who have published data to GBIF were examined in more detail. GBIF holds occurrence data of two main types — observation data (such as a record of a human or sensor observation of a bird), and specimen data (i.e., data ‘vouchered’ by a link to a physical object such as the specimens in collections). This study is focused on specimen data from collections, so ten of the 22 institutions were excluded from the analysis because they publish a majority of observation data rather than specimen data.
The remaining twelve institutions considered in the analysis were as follows: they have published data at various times since 2015, with total uploads containing over 99% specimen data, and no institution on this list uploading less than 50% specimen data:
Having identified the relevant institutions, the core dataset for this study was the collections data that those twelve institutions had published to GBIF at the time of this work.
To address our research questions about data availability and use, we needed to understand the volume of relevant data; its characteristics (e.g., specimen taxonomy and geography); and data use via downloads and citations.
To understand value and impact, we also needed to understand the topics of data use and citation, and how these translate to economic value; information about the users (e.g., their association with publications/citations and their institutional or geographic affiliation); and information about comparative costs particularly costs of physical visits.
GBIF data were used to determine:
1. details of the specimen data uploaded by UK institutions to GBIF, in relation to the wider body of all the specimen data available on GBIF:
2. Usage of the data uploaded by the UK, in relation to the wider body of all the data on GBIF:
Additional sources were used to enhance our analysis (see Fig.
DiSSCo UK survey data (
Google Scholar is intended to provide a simple way to broadly search for scholarly literature. Google Scholar data were used to determine onward citations of the publications that cited UK institutions’ data uploaded to GBIF, using the Search Engine Results Page (SERP) API with query parameters described in "03_Google_scholar_data_extraction.ipynb" (Suppl. material
Altmetric aims to broaden and deepen understanding of the value of research by tracking online engagement from a range of sources. Altmetric data were used to determine the Altmetric Attention Scores for publications citing UK institutions’ data uploaded to GBIF (
In addition, the NHM supplied unpublished data from the SYNTHESYS programme (
The NHM also supplied an unpublished estimate of total spend on its core digitisation team between the start of the Digital Collections Programme and this piece of research — some £5.4 million not including the numerous other resources involved in delivering digitised collections, such as curatorial staff, data managers, data portal developers and others. This figure gave a very conservative baseline of benefit provided to researchers accessing UK collections data that are free at the point of use.
Frontier Economics’ analysis (
Case studies and qualitative insights drawn on in this study have been sourced from:
Interviews were conducted to get feedback from a small subset of end users (scientific researchers) on the impact of UK institutions’ digitised data for them (e.g., ability to conduct their research, quality of research possible, efficiencies created), and to test assumptions used in the translation to value methodology.
There was limited time and resource to arrange and conduct interviews within the scope of this study — five interviewees were selected based on high frequency of data use; high impact of their research; and use of novel research techniques on the data. The majority of interviewees specialise in botanic material — this reflects the long standing availability of workflows to digitise pressed plant specimens, meaning that they are strongly represented in the currently available UK collections dataset. While this is a limitation, we do not believe that it materially affects the points made about data value — a range of research topics and techniques were discussed, and the benefits and opportunities identified appear highly transferable to other types of research and specimens.
A repeatable data engineering process was followed:
Further details can be found in the supplementary materials (Suppl. materials
It should be noted that, owing to the volume of GBIF occurrences, raw data could not be downloaded at the occurrence level for this study. Analyses have been conducted on aggregated dataset information or extracting counts from GBIF API end points. Further considerations and constraints in relation to particular areas of our analysis are set out below.
GBIF provides data on publications citing GBIF datasets, used in this study to address our questions about how UK collections data are currently being used in research. Query parameters used for this study can be found in "01_Overall_GBIF_data_extraction.ipynb" and "00_GBIF_UK_data_extraction.ipynb" in Suppl. material
There is of course a time lag between data upload and citation in publications. Publications refers to all relevant forms of literature, including journal articles, books, conference proceedings, preprints, reports, and others. Where relevant, our findings specify whether we were looking at all publications or specifically at peer-reviewed publications. Peer-reviewed publication data was taken from the GBIF web portal by setting both the “Peer-reviewed” filter to “Yes”, and the “Literature type” to “Journal article”.
Publication citations are made at the dataset level, not to individual specimen occurrences. This means that it is not possible from analysis of GBIF data to do an analysis of citations to individual specimens and to determine which, or how many, individual specimen records were cited. If a dataset contains both observations and specimens, publications cannot be split into those with citations to observations and those with citations to specimens — however the aggregate data uploaded by the twelve UK institutions considered in this report is 99% specimen data, indicating that publications with citations to these UK institutions’ datasets are citing specimens.
Where publications cited multiple institutions or multiple datasets within the same institution collection, these publications have been de-duplicated as necessary to show accurate numbers of publications.
A publication can have multiple topic tags. Therefore, a single publication may contribute to the count of several different topics. 99% of publications on GBIF have at least one topic tag.
In addition to direct citations, we also looked at onward citations of papers that cite GBIF data, to consider their ongoing impact. Numbers of onwards citations were taken from Google Scholar on 09/01/2023 using publication DOIs taken from the GBIF web portal on the same date. 77% of publications taken from GBIF have affiliated Google Scholar data, and 74% of these have a number of onward citations greater than zero.
Publications and citations do not fully capture the use of the UK digitised dataset on GBIF — the first step in use is usually downloads, not all of which result in publications. Download event numbers are therefore much higher than publication figures; for example, the number of downloads events is 254 times larger than the number of citations for Natural History Museum data. GBIF download data were therefore also analysed to consider the wider importance of UK collections data. Download data were taken from the GBIF API Endpoint on 12/12/2022.
Download events are by dataset. This is not representative of the number of occurrence records that have been downloaded, since any number of occurrence records can be downloaded in one download event. The Natural History Museum has one dataset in its collection on GBIF (all data uploaded are treated as a single set representing the NHM collection), therefore, the number of download events for this dataset is the same as the number of download events for the NHM collection as a whole. Other institutions have more than one dataset, which can result in higher numbers of download events linked to their collection.
We analysed researcher affiliation to understand the geography of collections data usage and to be able to understand and assign benefit to UK-affiliated researchers. GBIF and Crossref data were used for this.
Researcher geography data on GBIF were available for 87% of publications citing UK institutions. GBIF only provides the set of countries from which a publication’s researchers were affiliated; however, it does not provide the number of researchers affiliated with each country. Researcher geography data, where aggregated, has been de-duplicated such that publications with multiple researchers affiliated to countries in the same region are only counted once. For example, a publication with researchers from France and Spain will only be counted a single time in the number of publications with researchers affiliated to Europe. These data have not, however, been deduplicated between regions. For example, a publication with researchers from France and Brazil will be counted once in the number of publications with researchers affiliated to Europe, and once for South America. This means that the sum of publications across regions is greater than the true number of publications (2,710 publications).
Crossref provides the number of unique researchers affiliated with each country, for a given publication (for query details see "02_Crossref_data_extraction.ipynb" in Suppl. material
For analyses showing country of researcher affiliation, data were taken from Crossref on 12/12/2022 using publication DOIs taken from GBIF API Endpoint on the same date.
Simple arithmetic calculations were performed to arrive at key insights (for example, rates of citations per day), and estimated savings and benefits (Suppl. materials
The twelve institutions covered in this analysis have together uploaded at total of 7.6 million occurrences to GBIF. This equates to less than 6% of the 137 million specimens identified in the DiSSCo UK surveys (a greater percentage of these specimens have some form of digital record, however these data are not published to GBIF and the majority are not available as FAIR data). Three institutions (The Natural History Museum, Royal Botanic Garden Edinburgh, and Royal Botanic Gardens, Kew) have contributed 95% of total UK specimens uploaded to date, with over five million specimens (66% of all UK specimen uploads to date) from the Natural History Museum alone.
By uploading specimens, UK institutions are increasing the diversity of GBIF. 88% of all GBIF uploads are observations (with the remaining 12% being specimens). More than 99% of the uploads from the twelve UK institutions are specimens. Occurrences uploaded to GBIF by UK institutions make up only 0.3% of all occurrences on GBIF, but 3% of all specimens uploaded to GBIF.
UK institutions upload specimens with a different and more diverse geographical makeup than GBIF’s occurrence uploads as a whole. Specimens have been collected from 7 continents, and 248 territories and countries (including regions like the Vatican City, Greenland, Antarctic regions and other small territories like Sint Maarten). UK institutions provide a significantly higher percentage of specimens from South America, Antarctica, Asia and Africa (11%, 23% and 16% respectively) than wider GBIF occurrence data (4%, 5% and 3% respectively) (Fig.
GBIF uploads by continent of origin showing that UK collections have a relatively high proportion of African, Antarctic, Asian and South American specimens compared to GBIF occurrence data overall. Note: Incertae sedis (Latin for 'of uncertain placement') is a taxonomic grouping used when a specimen’s broader relationship to another taxonomic group is unknown. See Suppl. materials
UK institutions currently upload a higher percentage of occurrences than GBIF as a whole in two main kingdoms: Plantae (24 percentage points more), and Chromista (2 percentage points more), adding to the taxonomic diversity of occurrences (Fig.
Both uploads to GBIF from institutions around the world, and the number of publications citing those data, have increased over time (Fig.
Number of publications citing GBIF data over time, and number of those citing UK collections data. Note: This chart shows the number of new publications uploaded each year available on GBIF or with citations to UK institution data, it is not cumulative over time. See Suppl. materials
Despite a decline in the total number of publications citing GBIF data from 2020 to 2022, the number of publications citing UK institution uploaded data has increased by 40% over this time period. This may indicate that UK institution uploaded data provides additional value relative to the overall dataset available on GBIF.
The ratio of records uploaded to citations provides an approximate metric of “usefulness” of the data which has been uploaded to GBIF. It shows, on average, the number of additional specimens which had to be uploaded to gain each additional publication with a citation to that dataset. GBIF’s total collection has 7,411 uploads per publication, while the UK institutions have only 2,816 specimens per publication, some 2.63 times fewer. GBIF’s total collection has 17,248 records uploaded per peer reviewed journal article, while the UK institutions have only 3,943 specimens per peer reviewed journal article, over four times fewer. These proxy measures again suggest the high relative value of UK collections data.
Looking at the number of citations for each of the UK institutions in this study, the order is the same as that for volume of uploads, strongly indicating that there is demand for UK specimen data such that as UK institutions upload more specimen data, they also see more publications citing those data (Fig.
Overall, the UK’s digitised specimens are in demand and are highly used by researchers around the world. There are 2,710 publications citing UK institutions’ data uploaded on GBIF, 1,932 of which are peer-reviewed journal articles. While the UK specimens examined in this study make up just 0.3% of total GBIF occurrences, they are cited in 12% of the peer reviewed publications that cite GBIF, indicating that UK specimen data punches some 40 times above their weight compared to wider occurrence records.
We also examined onward citations and Altmetrics to consider the onward impact of publications citing UK collections data, however the time lag and continued growth of these metrics over time mean that a point in time analysis does not yield significant insights, particularly given that much of the growth in publication volumes themselves is comparatively recent. Of the 2,710 publications citing UK collections data, 57% had at least one onward citation that we were able to trace. 150 publications had gone on to receive over 30 onwards citations, and 21 had received over 100 citations, suggesting that research citing UK collection is having onward impact that enhances and enables further research. Altmetric scores for these publications showed no significant differences to the average overall.
While publications and citations are the most reliable indicator of usage and research impact, they do not fully capture the use of the UK digitised dataset on GBIF — use typically starts with downloading data, which may or may not eventually lead to a publication. The number of download events is 254 times larger than the number of citations for Natural History Museum (572k download events versus 2,253 citations). The number of download events again appears to track closely with the number of specimens uploaded by each institution, with the order of institutions by download events being the same as that by upload.
The number of download events for UK institutions data has been growing steadily since 2014. The Natural History Museum saw an especially high increase in the number of download events recently, from 92,000 in 2021 to 154,000 in 2022, a 66% increase and an average of one download every 3 minutes and 24 seconds.
Specimen data uploaded by UK institutions contributes to publications across a variety of highly impactful and relevant research topics like ecology, conservation and climate change (Fig.
The number of publications by topic tag (broad research area) that cite UK institution GBIF datasets, showing high numbers of publications relevant to ecology, conservation, climate, evolution, and invasive species. Note: 99% of publications on GBIF have at least one topic tag. A publication can have multiple topic tags, so a single publication may contribute to the count of several different topics. See Suppl. material
1,549 publications with citations to UK institution data have received one or more onward citations by other publications. This accounts for 57% of the total 2,710 publications with citations to UK data. 150 publications have gone on to receive over 30 onwards citations and 21 publications have gone on to receive over 100 citations. This suggests that much of the research conducted using UK institution data has been used in multiple onwards citations and has likely enhanced or enabled further research.
UK institution data uploaded to GBIF is used to support research in the UK, on six continents and in 160 countries and territories around the world (Fig.
Extrapolating the percentage of UK researchers where affiliation can be determined across the total number of publications, it is estimated that 1,200 UK-affiliated researchers have been supported by UK collections data.
1,137 publications cite GBIF data from only one of the twelve UK uploading institutions; however, 1,573 publications (58%) have citations to GBIF data uploaded by more than one UK institution. This suggests that UK institution data is often used in combination for research and publications.
In estimating the value created by digitising UK institutions’ collections we considered the existing investment in digitisation by the Natural History Museum (it was not possible to source comparable figures for the other eleven UK institutions within the scope of this study). Secondly, we consider efficiencies created for researchers and institutions:
Finally, we considered the economic value of the research supported by digitised data; and broader potential benefits to society of the digitised data beyond research (e.g. for education and entertainment, as well as potential to unlock further innovation and value in as yet unknown use cases).
Institutions around the UK have invested to digitise their collections and make them free to access on GBIF — the Natural History Museum alone estimate that they have invested approximately £5.4 million in digitising the collection over the last nine years (between financial years 2014–15 and 2022–23). The majority of this is investment from core Museum funds in the central digitisation team, not including the work of many other staff that is key to enabling digitisation, including curators, data managers, and the teams managing the Natural History Museum data portal and other collection management systems. This also includes over £2 million which has been raised (e.g., through grants and philanthropy) and spent on smaller digitisation projects that explore a new technique or capture more data (such as 3D digitisation of fossil mammals collected by Darwin -
To date there has been no dedicated public investment at the national level in digitisation of UK natural science collections. The Royal Botanic Gardens, Kew, were recently awarded £10 million through DEFRA towards their herbarium digitisation programme, to be spent during 2022–23 and 2023–24 (
As set out above, researchers benefit from the investment in digitisation that creates openly available collections data — this value is not known across the twelve relevant UK institutions but is in excess of £5 million for the Natural History Museum alone.
Investment in digitisation by institutions is more cost-efficient than digitisation by individual researchers, due to best practice workflows, economies of scale, team capabilities and not needing to travel or to ‘pick and choose’ individual specimens. Moreover, the data are reusable to all scientists (and others) when institutions make them freely accessible via GBIF, rather than being individually prepared and privately held or uploaded to disparate repositories. For individual researchers to create the datasets that they have downloaded and cited from GBIF themselves would require significantly more funding, due to the scale and efficiency of institutions’ digitisation; the reusability of data; and the need in many cases to create data from multiple organisations.
In practice, some of the research citing UK collections via GBIF would not have been possible at all without these data being available — either it would not have been conceived (for instance, AI uses of collections images in
To produce a publication that concerns specimens, especially in large quantities, researchers need access to information about those specimens. The needs of each publication will vary considerably, but without digital access, physical visits are often required to collect such information. While some publications would not necessitate a physical visit, some would require multiple visits to collections in different locations — and, at the very least, most would require data collection by someone working with the relevant collection. Physical visits incur significant costs in researcher time, researcher costs (e.g., travel and subsistence), and costs in time and effort to the host institution.
We therefore assumes as a conservative approach that each citation of the relevant UK collections datasets would require a single visit to a relevant institution if the data were not already available. Visit costs were estimated based on unpublished data from the Natural History Museum’s leadership of the EU SYNTHESYS programme (
Based on 2,710 publications each saving £6,500, this equates to savings of £17.6 million for researchers and hosts across all the publications citing UK collections. Looking just at savings to researchers (not hosts), extrapolating publications with a UK affiliated author (see Section 3.2.3), we estimate £1.4 million in savings to UK-affiliated researchers (Suppl. material
It is likely that looking only at citations underestimates the efficiency benefits created. In theory, anyone downloading the data would otherwise have had to visit or otherwise request its creation. At the point of this analysis, there had been 571,518 download events of NHM data. Taking an approximate researcher time/cost of £3,500 (excluding host costs) and applying Frontier Economics assumptions of benefits in the range of 5–12.5% (
Time and money saved for researchers can go towards further research, with an estimated 20–40% return on investment for society (
In total,
Other publications also contribute less directly to these themes, and/or to related areas, e.g., those on biodiversity science and ecology. Mineral exploration is an area that is not currently represented by analysis of GBIF biodiversity data. And citations of UK collections data via GBIF also cover a range of other topics as set out above (Fig.
Wider uses and benefits beyond research are outside the scope of this study, but it can easily be seen that there is scope for digital collections to play a role in education, engagement, the arts and innovation, with further benefits to the economy and society in the UK and beyond. The full possibilities and economic impact of digitised natural science collections cannot be anticipated.
Our interviews highlighted the breadth of benefits from digitised UK collections data, both to individual researchers, their areas of research, and wider society (see interview case studies 1 and 2). We spoke to a climate change researcher in Indonesia (who did not wish to be named), who told us that digital data directly inform the conservation and policy priorities for her work, as well as saving time and money.
Alexandre Antonelli is Director of Science at the Royal Botanic Gardens, Kew, and a biodiversity researcher. He uses GBIF data almost every week, both for his research and in answering policy and media questions about biodiversity. Much of his research would not be possible without digital data. He told us that digital integration of collections, and continued effort to include smaller but locally important collections around the world, are key to having a complete picture — and the future opportunities are extraordinary, particularly when factoring in AI tools to speed the processes of recording and extracting information, and using it e.g., for species identification.
Colin Khoury is the Senior Director of Science and Conservation at San Diego Botanic Garden, and a plant and conservation scientist who focuses on food crop diversity. He has used digitised data throughout his career, including to understand the distribution and conservation status of wild relatives of crops (e.g.,
Phillip Fenberg, Researcher at the University of Southampton and Science Associate at the Natural History Museum, uses collections data, combined with occurrence records from monitoring and other key datasets such as temperature, to ask questions such as how organisms respond to climate change. Natural science collections enable these questions to be studied over periods of many decades.
While Phillip’s original PhD research involved him visiting museum collections to gather specimen data in person (e.g., body size measurements, occurrence records), digital collections data have transformed the efficiency and scope of what is possible. The combination of digital collections images and new computer vision techniques for analysis is incredibly powerful, allowing for previous hypotheses to be tested at scale. For example, Phillip and his team used the NHM iCollections dataset of over 180,000 butterfly specimens (
Use of an innovative computer vision pipeline — ‘Mothra’ (https://github.com/machine-shop/mothra) — showed that it was possible to accurately detect specimens in images, set the scale, measure wing features such as forewing length, and identify the sex. Not only that but like for like comparison of forewing length measurements showed that Mothra could complete work in a week (or less, if more than 10 analyses had been run in parallel on a computer cluster) that would take a human some 3,000 hours, or around two years (assuming eight hours a day with no breaks, and only one measurement (forewing length) per specimen).
Phillip is looking forward to the expansion of digital collections image data, particularly the possibilities that will come with increased linkage between genetic and image datasets; the greater integration of AI into taxonomic work; and useful metadata such as information on what proportion of any particular collection set has been digitised.
Conservation scientist James Westrip (interviewed 6th January 2023) is a ‘superuser’; an author of some 117 papers citing UK collections data, owing to his work with the International Union for Conservation of Nature (IUCN), assessing species for the ‘Red List’ on species conservation status.
Red List assessment demands good data about species distribution over time — many species are data deficient and cannot be assessed, meaning that risks of biodiversity loss are greater than reported, and key conservation actions may be missed.
Since 2019, GBIF data have been transforming how James (and his colleagues) do their work. Geographical data is the most critical for them — ideally in the form of a fully geo-referenced latitude and longitude for specimen collection, but descriptions from labels can be sufficient. This enables species distribution and prevalence to be examined over time, based on different collecting events. Habitat data can also be helpful — one of the benefits of collections data is their coverage of rarer species that are not often observed by humans otherwise.
These data are helping to make the Red List more comprehensive and in particular more representative of species diversity, covering for example more insects, plants and fungi as well as vertebrates which were traditionally well-represented.
And they make the work much more efficient — combining digital specimen data with mapping tools reduces the time taken for many species assessments from weeks to just a day or two. While data quality isn’t always perfect, James has processes to identify and remove outliers. Digital data have reduced the checks needed with collections staff. This work directly informs policy decisions, so the more data are available, and the more species covered, the more impactful it will be.
Going back to our research questions around the volume of data currently available from UK natural science collections; the uses of these data; and their value in terms of research efficiency, research effectiveness, and wider impact; the demand for and potential of UK natural science collections data are very clear.
We can see major usage, research impact, and benefits for research efficiency, the economy and society even from less than 6% of relevant collections’ data, which make up only 0.3% of total occurrences on GBIF. Even at this small percentage, UK collections data are contributing to the historic, geographic and taxonomic diversity of GBIF; being cited at a rate 40 times that of other GBIF occurrence data, in thousands of publications (2.2 per day on average for the Natural History Museum alone), across topics that reflect the key challenges facing humanity and the planet; and yielding more than £18 million of efficiency savings for researchers.
It is estimated that $44 trillion of economic value generation (or over 50% of the world’s GDP) is moderately or highly dependent on nature, with biodiversity loss and ecosystem collapse among the key challenges that the planet faces (
Understanding what is in collections now, in the UK and globally, is also key to understanding what is needed as we collect for the future, to underpin policy and investment decisions in future centuries (
Digitisation also brings wider benefits than those examined in this study. As the custodians of collections from around the globe, digitisation of natural science collections held in the UK supports the involvement of communities of origin and the enrichment of collections through the knowledge and experience of these communities and of experts from the global network, including the opportunity to understand and address biases and the legacies of colonialism. The broader significance of these collections for education, the arts and humanities, and of course leisure and wellbeing can also only be enhanced by the availability of digital collections data for discovery and access.
While this study has yielded useful insights into the current use and impact of UK collections data, it does face limitations. GBIF data do not reflect the full breadth of UK collections in the geo-sciences, underestimating benefits in this important area. It has not yet been possible to consistently track the use of particular specimens (or groups of specimens), or to reach granular insights about the usefulness of particular data fields, although we know from qualitative discussions that, for example, geo-referencing is frequently of high value. Insights therefore cannot yet be used to inform detailed prioritisation of digitisation activities.
Stakeholders also expressed interest in better and wider metadata, for example to understand the percentage of a particular collection type that has been digitised (thus understanding not only the data available but the data gap, and what might become available in future). Usage of collections outside those published on GBIF is much harder to quantify, with high variety in availability, quantity and measurement — this also applies to physical use of specimens, where visits are recorded differently by different institutions or sections, and it can be hard to trace citations of physical material.
The Global Registry of Scientific Collections (https://www.gbif.org/grscicoll) combined with the nascent Latimer Core data standard (https://github.com/tdwg/ltc) for collections descriptions are promising community developments that aim to support the representation and discovery of natural science collections prior to full digitisation, by structuring data about higher level groups of objects within those collections, allowing a line of sight for discovery and use that can start at the level of a whole collection and link through to subgroups and individual items when data about these becomes available (
It is also a welcome development that, while the SYNTHESYS access programmes for natural science collections across Europe are now complete, DiSSCo EU (https://www.dissco.eu/dissco/timeline/) and the UK AHRC Research infrastructure for conservation and heritage science (RICHeS) programme continue to develop avenues that can associate the digital discovery of specimens with targetted routes to physical access and enhanced analyses, associated with access to relevant facilities and labs (
UK Natural Science collections are joining forces through the Distributed System of Scientific Collections UK (DiSSCo UK) to set the vision and make the business case for investment in these collections as a distributed research infrastructure (
The UK has set itself the ambition to be a science and technology superpower (
Contribution types are drawn from the CRediT Contributor Roles Taxonomy.
The Natural History Museum would like to thank McKinsey and Company for analytical support. The Museum are responsible for the conclusions and recommendations of the research.
Thank you to our Natural History Museum colleagues who supported this work in a variety of ways including attending steering meetings, and signposting domain-specific data resources: Lisa French (https://orcid.org/0000-0001-7279-8582), Al Turner (https://orcid.org/0009-0009-4911-1696), Sarah Vincent (https://orcid.org/0000-0002-4012-0571), and Tara Wainwright (https://orcid.org/0000-0001-5816-2529).
We would like to thank our interviewees for their enthusiasm and insights into using digital collection data — thank you Alexandre Antonelli (https://orcid.org/0000-0003-1842-9297), Phillip Fenberg (https://scholar.google.co.uk/citations?hl=en&user=-CSvSUQAAAAJ&view_op=list_works) , Colin Khoury (https://orcid.org/0000-0001-7893-5744), and James Westrip.
We would like to thank our colleagues at the Global Biodiversity Information Facility for their support with extracting and interpreting some of the relevant data.
The Natural History Museum, London
Thirteen Jupyter notebooks (*.ipynb files) for the quantitative data extraction, cleaning, analysis and visualisation.
The code used is not production ready and should not be expected to follow best software engineering principles like modularity and unit testing. The code is meant only for re-running the analysis with underlying data.
Document describing the data extraction framework, data landscape for analysis, data pipeline architecture, and infrastructure setup (compute & storage, Conda environment, Jupyter and Kernel, libraries used).
Document with questions used for qualitative insights on researchers’ experience working with digitised specimens.
"We’re a not-for-profit membership organization that exists to make scholarly communications better. We rally the community; tag and share metadata; run an open infrastructure; play with technology; and make tools and services—all to help put research in context." -
Country of specimen origin data can be established for 66% of specimens uploaded to GBIF by the 12 UK institutions considered, while continent of origin data can be established for 95% of all occurrence uploads to GBIF.
As of 2023-08-09 the funding call for "Host facilities as part of our heritage science infrastructure" had a total fund of £15,700,000 and a maximum award of £1,000,000. The fund scope was to "[...] enable you to purchase or build equipment and upgrade facilities that complement your existing research strengths, and ongoing funding to recruit and retain staff to enable access to your research facilities and collections." and to "[enable] access to heritage science facilities, collections and expertise for a wide range of users, to catalyse new collaborative research projects and amplify the impact of heritage science research."