Research Ideas and Outcomes :
Project Report
|
Corresponding author: Donald Hobern (dhobern@gbif.org)
Received: 08 Dec 2022 | Published: 12 Dec 2022
© 2022 Donald Hobern, Laurence Livermore, Sarah Vincent, Tim Robertson, Joseph Miller, Quentin Groom, Marie Grosjean
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Hobern D, Livermore L, Vincent S, Robertson T, Miller JT, Groom Q, Grosjean M (2022) Towards a Roadmap for Advancing the Catalogue of the World’s Natural History Collections. Research Ideas and Outcomes 8: e98593. https://doi.org/10.3897/rio.8.e98593
|
Natural history collections are the foundations upon which all knowledge of natural history is constructed. Biological specimens are the best documentation of variation within each species, increasingly serve as curated sources for reference DNA, and are frequently our only evidence for historical species distribution. Collections represent an enormous multigenerational investment in research infrastructure for the biological sciences, but despite this importance most of the holdings of these institutions remain invisible on the Internet, inaccessible to taxonomists from other countries and hidden from computational biodiversity research.
Although comprehensive digitisation of the complete holdings of each natural history collection is the long-term goal, this is an expensive and labor-intensive task and will not be completed in the near future for all collections. However, many benefits could quickly be achieved by publishing high-quality metadata on each collection to increase its visibility, provide the foundations for further digitisation and enable researchers to discover and communicate with collections of interest.
This paper summarises the results from a consultation activity carried out in 2020 as part of the SYNTHESYS+ (Synthesys of Systematic Resources), “Developing implementation roadmaps for priority infrastructure areas as part of cooperative RI for biodiversity” project. This consultation was primed through an ideas paper, and introductory webinars and conducted as a facilitated two-week online multilingual discussion around 26 topics grouped under four broad headings (Users, Content, Technology and Governance). The results of these discussions are summarised here, along with the wider context of existing and planned initiatives.
biodiversity, natural history collections, data standards, data linking, taxonomy
The creation of a catalogue of the world's natural history collections is central to the shared vision and goals of a large number of institutions, projects and other stakeholders and initiatives within the natural history and wider science collections landscape. However, the number and diversity of interested parties brings with it key challenges around unification of approach, interoperability of already developed and widely used systems, and the differing requirements of such a wide range of user groups.
Even in the absence of data on the specimens held in these collections, information about each collection contributes to a map of the resources supporting taxonomy and biodiversity research and assists researchers in locating and contacting the holders of specimens. Collection records contribute to the development of a fully interlinked biodiversity knowledge graph, showcasing the existence and importance of museums and herbaria and supplying context to available data on specimens (
There is currently no definitive figure for the number of specimens held by collections globally, but estimates range between 1.2 and 2.1 billion (
The basis for this report is the ideas paper by
Category |
Topic |
---|---|
Use |
|
Use |
|
Use |
|
Use |
|
Use |
1.5. Increased value for data on specimens, taxonomic publications, etc. |
Use |
|
Use |
|
Use |
1.8. Improvements to citation and visibility for collections |
Use |
1.9. Support for national and regional needs and applications |
Information |
|
Information |
|
Information |
|
Information |
|
Information |
|
Information |
|
Technology |
|
Technology |
|
Technology |
|
Technology |
|
Technology |
|
Governance |
|
Governance |
|
Governance |
|
Governance |
|
Governance |
|
Governance |
|
Language |
Adelantando el Catálogo de Colecciones de Historia Natural del Mundo |
Language |
Progressant le Catalogue des Collections d’Histoire Naturelle du Monde |
Language |
|
Process |
This paper uses the outcomes of this consultation to identify common themes, priorities, areas of consensus, and areas of dispute. These will be used to propose a vision for how a global collections catalogue may be developed, covering use cases, information, maintenance, resourcing and sustainability.
The ideas paper outlines a range of potential use cases based on those collected by the TDWG Collection Description Interest Group*
Four broad headings are described by the ideas paper:
Uses for the catalogue
Information in the catalogue
Technology for the catalogue
Governance of the catalogue
The sections in this landscape overview are based on the contributed materials for the community consultation supplemented with additional research to give an overview of the key platforms and databases, collections management systems, data standards and other community activity. The aim has been to provide background information for readers but not comprehensively cover the current landscape.
A number of existing catalogues for institution, collection and specimen-level information are already in use or development, driven by several community-driven initiatives and projects. There are other broader sources of information that could be integrated or used in a future platform. To prevent record duplication and minimise the level of resource required to create collection catalogue records, the scope, controlled vocabularies and preferred identification schema of the most relevant systems should be investigated and incorporated during development of the collection catalogue data architecture.
Atlas of Living Australia Natural History Collections - the ALA Natural History Collections page (formerly known as the “Collectory”) is an example of a national information resource on natural history collections. ALA has a high calibre informatics and software development team and receives strong institutional support and engagement on the national level. ALA collection records do not currently use a standard vocabulary and the repository is struggling to de-duplicate collection-level records contributed for different views of the same collection (
CETAF Collections Registry/CETAF passports - the Consortium of European Taxonomic Facilities (CETAF) provide a central source for information about its 63 European member organisations. ‘CETAF passports’ are contributed as a condition of membership and include high-level categorisation of collections including non-mandatory collection size metrics. CETAF is currently building on the functionality of CETAF passports with the development of the CETAF Collections Registry and has proposed assigning unique institutional acronyms to each member, which may cause some overlap/conflict with existing identifiers (
The Global Registry of Scientific Collections (GRSciColl, including GRBio) was initially developed as a global ‘clearing house’ of information for institutions and collections before being incorporated by GBIF in 2019. GRBio held information on biodiversity collections and was a subset of GRSciColl which is open to all categories of scientific collection. Although its content is currently incomplete, GRSciColl is considered a viable framework for expansion and is currently in a new phase of development. So far synchronisation has been established with Index Herbariorum (see below) and content from the iDigBio collection database has been integrated, with GRSciColl now powering the iDigBio collection portal. GBIF are now actively developing the codebase, where a role-based authentication model enables wider contributions. During the consultation, key priorities for 2021 were identified and these have been implemented. The draft GBIF work programme for 2023 includes a goal to “enrich GRSciColl through the integration of collection description information, compatible with the Latimer core, to support use cases such as priorities for data mobilization” (
iDigBio web portal - iDigBio is the United States national resource for digitised information about natural history collections. The iDigBio specimen portal makes available millions of records from neontological and paleontological specimens curated at museums and other institutions in the US. The data held in their repository follows the Darwin Core and Audubon Core data standards and iDigBio has contributed upwards of 1.5k collection-level records to GRSciColl to date (
Index Herbariorum – This is the most successful and established collections catalogue and covers the world’s botanical collections. Indeed, its use is recommended in the International Code of Nomenclature for algae, fungi, and plants (
Wikidata is already recognised as an identifier broker with potential to advance biodiversity knowledge graph development (
The Global Research Identifier Database (GRID) and the Research Organisation Registry (ROR) are existing databases of globally unique persistent identifiers and associated metadata for education and research-related organisations across all disciplines. Each service holds data on more than 100,000 organisations, and their identifiers are interoperable. GRID is a commercial product managed and owned by Digital Science. GRID provided the seed data for ROR, which is a community-led initiative.
These databases could potentially be used as a starting point for institutional identifiers.
Collections Management Systems (CMS) are databasing tools that are used to organise, control and manage information on behalf of natural history collections. They support many tasks that are important to operation within each collection, including inventory management, creation and publication of descriptive specimen and collection metadata, risk management, collection conservation and assessment, exhibition management, loans and research requests, and as stores of legal information regarding the acquisition and use of collections.
Collections management systems are likely to be one of the fundamental sources of natural history collections data but there are several challenges using them as to contribute and maintain entries in a catalogue of collections. Many different systems are in use. A survey of European collections conducted by DiSSCo (
There are no studies evaluating these various systems as a source of standardised collections metadata. CMS interoperability has been studied at a limited scale with a focus on specimen/observation data.
The standards summarised in Table
Standard | Description | More Information |
Darwin Core (DwC) | Darwin Core is the most widely used standard for sharing data on natural history specimens and biodiversity observations. It builds on existing metadata standards (like Dublin Core) and is supported by the majority of specimen-level data repositories and community tools/platforms. |
|
ABCD | The Access to Biological Collections Data (ABCD) Schema is an alternative standard for specimen data. ABCD is a comprehensive, complex, structured standard for biodiversity data. | |
ABCDEFG | ABCDEFG (Access to Biological Collection Databases Extended for Geosciences) is an extension to ABCD developed to support palaeontological, mineralogical and geological digitized collection data. |
|
TDWG Attribution project | A collaboration between TDWG and the Research Data Alliance to enhance existing and create new standards for giving attribution for the maintenance, curation, and digitization of physical and digital objects with a special emphasis on biodiversity collections. |
|
Audubon Core | Audubon Core (AC) is a set of vocabularies designed to represent metadata for biodiversity multimedia resources and collections of such resources. The vocabularies address such concerns as management of media, descriptions of content, taxonomic, geographic, and temporal coverage, and appropriate ways to retrieve, attribute and reproduce them. |
|
Natural Collections Descriptions (NCDs) | The NCD standard arose from an earlier TDWG attempt to define a collection-level data standard. NCDs are actively used by several platforms outlined in 2.1., but subsequent development efforts stalled and as a result this standard has not been more widely taken up. The TDWG CD model (see below) is acknowledged as the natural successor/continuation of the NCD standard. |
|
TDWG Collection Descriptions (Latimer Core) | Building on earlier work in the NCD standard, the TDWG Latimer Core collection descriptions data standard will define a set of classes and properties that can be used to represent groups of collection objects and their associated information. These incorporate common characteristics used to describe, group and break down collections, metrics for quantifying those collections, and properties such as persistent identifiers for tracking collections and managing their digital counterparts. Coupled with flexible underlying data models, the CD standard is intended to support use cases from simple, high-level collections summaries to detailed quantitative collection breakdowns and assessments. |
|
GRSciColl
The GBIF Secretariat coordinates updates to GRSciColl. Edits are performed by data managers from the GBIF Secretariat and iDigBio. Other changes are imported from Index Herbariorum or through contributions by staff from institutions and national nodes within the GBIF network. Any user can suggest changes for inclusion following review by the GBIF Secretariat and community. Data from some national surveys have been uploaded directly into GRSciColl.
This section presents the community’s priorities for a collection-level catalogue as a summary of notable areas of consensus and concerns that emerged during the consultation process. We have followed the four high level categories (Use, Information, Technology, Governance) and 25 subcategories used in the community consultation. In a few instances, we have referenced comments from other subcategories if these more naturally relate to the topic under discussion. Where applicable, we have provided links "(ref)" to the original discussions in the GBIF Community Forum which are also archived in Suppl. material
The Use category included nine topics.
By establishing natural history collections as a global scientific infrastructure we make it easier to foster new collaborations, resource research, fund opportunities and support sustainable data infrastructure. By standardising our institutional acronyms and the collections held within them, we improve collection discoverability and citability, making it easier to demonstrate impact and importance (ref). We can make use of existing persistent identifiers (PIDs) in GRID or ROR, so we are not establishing a set of new PIDs and benefit from integration and re-use (
There are many collections that are mostly invisible due to the predominantly specimen-based approach to digitisation. Specimen-level digitisation is often costly. Publication of collection-level data should be recognised as an important and cost-effective starting-point (ref). We recognise that understanding and serving the needs of different users will be important and that keeping the collections data up-to-date will be a challenge (ref).
A catalogue that provides summary information on the holdings of each collection would be a highly useful resource, if the summary information was relevant, reliable and could be kept up-to-date.
Previous initiatives relating to the creation and aggregation of collection-level catalogue records have increased use of and interest in items in the collections (ref). Summary collection information acts as a ‘signpost’ for end-users to help them narrow down which of the world's collections may hold items of interest and facilitates further investigation and communication with collection managers. Collection-level records would also help to document key networks and linkages between specimen data and existing related data platforms such as the International Nucleotide Sequence Database Collaboration (INSDC) databases (ref). This can be expected to increase the points of discovery and entry for underserved or non-traditional users.
The minimum level of information required for collection records to be a useful resource is likely to vary across disciplines, user groups and geopolitical contexts (ref). There is a general consensus that details on the institution holding the collection, taxonomic scope, and metrics on the size of the collection should all be mandatory fields (ref). These could be augmented with optional fields that allow additional data to be shared where available (ref).
We need to provide guidance and support to the community, particularly to collections staff. This includes the need for good tools and tutorials for curating, updating and disambiguating collections records (ref). The community will need region-specific roadmaps and strategies as levels of support and motivations vary (ref). Current emphasis on publishing specimen records lessens potential data sharing of less well-resourced collections that are effectively excluded. The GBIF dataset classes (Resource, Checklist, Occurrence, Sampling-event)*
Publishing a metadata-only dataset (Resource) could be sufficient to advertise a collection and information about its holdings. The collection would become Findable, even if not digitally Accessible, Interoperable or Reusable. The Integrated Publishing Toolkit supports metadata-only datasets.
If a collection is then in a position to add a checklist dataset summarising species held - this was quite a common category of web page 15 years ago - the collection could be listed in simple ways on GBIF species pages, again further raising its profile for wider access and use. This adds some Interoperability. Databasing as DwC specimen data then takes things forward and allows for full "FAIRness" (
Estimates of collection size are already widely held and used by collection-holding institutions, but these metrics are decentralised and typically provide little information on the assessment methodology used.
High-level estimates of collection size would be useful to external stakeholders such as government agencies (ref). Collection size estimates can be used to represent the ‘value’ of collections on the national and global scale and would be invaluable in helping the community to ‘build funding cases, show current (often national) capacity, and highlight gaps‘ (ref) (
To be useful, such estimates would need to be either developed under a shared methodology (e.g., the One World Collection project) (ref), or contain sufficient methodological information to allow users to assess the applicability of the record for comparing collections or aggregating metrics (ref). The former approach would make the catalogue easier to use, but the latter would facilitate data collection and re-use of existing information.
Standardised methodologies for valuing collections based on scale and scope are already in active use (ref), but there is risk attached to following a single model in this respect: the value of collections will ultimately depend on the requirements of those seeking to use them (ref).
We recognise that better linkage of collections metadata, including information on the main collectors who helped to build the collection, with other external identifiers and authorities like ORCID, Wikidata, and VIAF will improve discoverability both inside and outside our community (ref). If we are able to combine collector information with understanding of the taxa present in collections (e.g. at a checklist as opposed to a specimen/occurrence level) we would have a better understanding of what makes a collection unique (ref). Detailed information on specimen preservation methods is important for collections users (ref).
A large amount of information about collections is already available on institutional websites, but effort is required to pull this together and maintain it over time. It would be helpful to provide a template or other pro-forma data collection mechanism to let collection managers update summary data quickly and easily (ref). Some institutions already record curatorial assessments for their collections; it would be benefiical to support these assessments, along with all supporting information, as part of a world collections catalogue (ref).
Providing reusable collections data and standardised institution and collection names would reduce the overhead on other specialised collection catalogues such as the Global Genome Biodiversity Network (GGBN) which currently maintains its own general collections registry and could instead focus more time and community effort on collections biobank metadata (ref).
There are recent discipline-based examples of assessing the state of collections (
A collections catalogue would make collections more findable and accessible (in the sense of the FAIR Data Principles,
Cooperation with other initiatives like the International Nucleotide Sequence Database Collaboration (INSDC) is crucial to allow linking sequences that lack references to collections to corresponding voucher specimens and samples. Building tools to help researchers submit better metadata is important (ref).
When considering new and enriched services, we should also be mindful of focus, delivery and utility. While new downstream use is important, we should consider focusing narrowly on what queries the catalogue can support best in the short to medium term and that correspond to a sufficiently important audience (e.g. large, high impact, well-resourced, etc.) (ref). We can look at other adjacent sectors for analogue data infrastructures and what makes their core services successful (
Research value is primarily measured in terms of visibility and impacts from published literature. To be recognised by such measures, the citation and attribution of natural history collections needs to be agreed and standardised across the community and made visible and useful to stakeholder groups such as publishers, funding bodies and data aggregators (
Understanding the community’s existing practices and data quality issues in this area is key to successfully developing the collection catalogue so that citation of collection-level records is sustainable, measurable and more fit-for-purpose than current practices (ref). Outcomes from this analysis, such as comprehensive lookup tables of identifiers used for particular collections or institutions (even when these are not unique within the broader collections community) (ref), could improve discoverability of collections from an end-user perspective, feed into current initiatives to unlock the historical scholarly record (ref) and aid in the discovery and embedding of linkages between related outputs (ref).
Previous initiatives around standardising citation and attribution have stalled due to lack of uptake (ref); a critical mass of adopters is required before stakeholders outside of the core community (e.g., publishers and aggregators/content banks) will change their working practices to incorporate a particular standard. Additional barriers to user uptake include a lack of guidance around attribution practices both for collection users and for collection-holding institutions and uncertainty around proper citing procedure for collection data from aggregators and other secondary sources (ref). It may be difficult to get authors to consistently use a standard abbreviation. It might be easier to simply link multiple abbreviations to a single, stable PID (ref).
Engagement may be encouraged via links with other data repositories, especially those with established infrastructure and dataflows related to the identification and resolution of research citations. ROR, GBIF and Wikipedia, for example, already integrate with Datacite and Crossref (ref), both of which provide impact metrics that would incentivize both contribution to the catalogue and adherence to related standard citation practices (ref).
One of the biggest issues we face is demonstrating the role and value of collections (value is covered in more detail in section 3.1.4) . This is often a national challenge because this funding primarily operates at this scale , but on occasion becomes a continental or global challenge (ref). A more integrated model of the natural world, founded on observations and collections, would provide evidence of where we are deficient in data, and to identify which organisations might coordinate to fill these gaps at a national or regional level (
Uniqueness of collections can help focus prioritisation for digitisation (and other activities) at a national and regional level. It can act as a starting point for understanding how to effectively collaborate and pool resources (ref).
In other sections it was noted that some countries have minimal online catalogues, or resources shared in languages that may make them less internationally discoverable (ref). National legislation can play an important role in motivating data sharing and coordinating national activities (e.g. the Registro nacional de Collecciones in Colombia) (ref). This may be an example other countries or research councils could adopt.
A collection catalogue should mandate a minimum number of standard fields such as: taxonomy, holding institution and collection scale metrics which could be augmented with additional fields where available.
Strong guidance and support materials must be available to the community to support the catalogue.
There is a need for ongoing methodological standardization while maintaining flexibility for institutions and for national, regional and taxonomic networks.
Collection records should maintain linkages with other external identifiers and authorities.
The collections catalogue should be built in such a way that it can also support use as a national resource.
The Information category included six topics.
The definition of a natural history collection is broad and reflects the goals and uses of the collection as well as its contents. At its core, a natural history collection represents evidence of biological and geological diversity on Earth, but collections may include related objects such as extraterrestrial geological specimens and anthropological artifacts. Living collections, whether in an active or dormant state, can be included. Furthermore, the collection objects themselves are not necessarily items of biological and geological diversity but may include associated materials such as field notebooks, photographs and ethnobotanical objects. Collections can be eclectic or have a specific focus and raison d'être, such as a xylarium.
There is also a wide range of different usage-based goals for a collection. Some are purely used for taxonomic research, but there are others that focus on education, history, material science etc. The group is not even managed under the same set of regulations, management and ethical considerations. Living collections, human remains and objects of cultural significance have specific requirements that determine how the collection operate. One cannot even state that each collection must be under the control and hosted by a single institution, since we need to be able to refer to collections that no longer exist in this way, having been destroyed or divided up.
In some cases collections are defined as the complete set of materials held at the level of a single institution, as is true for most herbaria listed in Index Herbariorum. However, in many other collections the material forms seveal separately identified collections, divided perhaps by curatorial practices, by taxonomy or by the collection's origins.
Collections often map to organisational structure and to curatorial approaches rather than adhering to consistent definitions. This conflation of institutional structure with institutional collection(s) is too frequent to have occurred by chance; it seems reasonable to assume that operational concerns and priorities (e.g., naming/defining a collection to reflect acquisition or provenance events) play a key role in shaping the community notion of a ‘collection’.
Ultimately, the easiest way to define a collection in the Catalogue may be purely in terms of usage: collections are the entities to which we need to refer when organising information about the materials they hold. If we need to be able to refer to these collections in a reliable way, they each need an entry within the Catalogue. Consideration must also be given to the advantages that the collections and the collections-holding institutions may gain from being listed.
Differentiating “natural history” collections from associated collections is important, but we need the ability to reference and link to holdings that are regularly treated as adjunct collections (archives, field notebooks, registers, photographic collections) and born-digital collections (e.g. sound records, camera trap images). These may be considered and identified as discrete collections in their own right (see section 3.2.5).
The broad consensus was that the scope for the Catalogue should be broad and inclusive, including all collections that are useful for natural science, natural history or natural heritage. This includes xylaria, ethnobotanical, paleontological collections and anthropological collections. Some of these collections will have sensitivity and legal restrictions that need to be managed when sharing their descriptions.
Multiple collection identification schemes exist and are in actively use. Collections are often identified in parallel in multiple schemes, a situation which reflects the flexible definition of a collection as discussed earlier (ref). A number of identifier schemes are provided by or derive from data platforms and services: GRSciColl, ROR, ALA Natural History Collections and the GBIF Registry (ref). Identifiers for an organisation or unit within an organisation have also been widely adopted as a shorthand to refer to the collections they hold, even if the original organisational entity no longer exists in an operational sense (ref).
It may be the case that only these more traditional collection identifiers (e.g. the identifier for a specific herbarium) need to be human-readable because of their historical use in previous and current registries (ref). We need to avoid conflating the purpose of and requirements for human and machine-readable identifiers: machine-readable identifiers need to be globally unique, persistent and resolvable. They should provide unambiguous identification of a collection — even if the contents or environment of the collection changes over time — and facilitate wider data linkages. Human-readable identifiers need to be succinct, descriptive, memorable and, if not unique and persistent, contextually flagged clearly enough to enable software systems to distinguish and accommodate this (ref).
One approach to prioritising existing identification schemes within the Catalogue would be to select those that most closely map to a discrete class of collections within the Catalogue (ref). It would also be prudent to prioritise identification schemes on their technical capacity, accessibility, underlying infrastructure and accompanying data services.
Usage of preferred identifiers could be promoted by the development of resources and activities focused on community engagement and increasing the wider awareness of the benefits and availability of the selected schemes (ref).
An important consideration is whether the Catalogue should represent the complex historical and contemporary relationships between collections and subcollections that may be important to different communities. The alternative would be a simple flat catalogue which treats all included entities as equivalent.
Hierarchical relationship structures would be useful for collections that have changed ownership or location over the course of their lifespan. For example, a subcollection record could be linked to a ‘parent’ collection record to reflect provenance and facilitate discovery (ref). Flexible parent-child relationships of this kind could go beyond fixed hierarchies and also represent alternative classifications of subcollections. Hierarchies are less suitable for use in scenarios where a single collection object falls under the scope of several different collections (ref). Such nested scenarios are common and, unless carefully handled, could lead to double-counting and inflation of collection size metrics.
A system that is not fully hierarchical could be a workable model capable of handling most use cases so long as a few primary classes of entity (e.g, institution, collection, subcollection, dataset) are properly defined, standardized and incorporated during its design (ref). Standardised relationships between instances of these few classes could maintain the simplicity of the Catalogue while allowing most situations to be represented appropriately through judicious mapping of real-world entities against the classes. The development of well-defined classes would also enable aggregators and other platforms to validate the integrity of the Catalogue, reason over relationships and logically constrain the operations that can be applied to different classes of catalogue record (ref). The nature and scope of each class of collection record needs to be communicable to end-users to allow for different search strategies based on their data requirements (ref).
The community broadly supports use of the TDWG CD standard for collection descriptions (
The TDWG CD model centres on a small number of mandatory fields and a larger range of optional fields. This approach allows different classes of collection description records to be described using dimensions most appropriate for the discipline, while still controlling the quality and integrity of the data for core fields and allowing some level of class interoperability (ref). The flexibility to describe different collections using optional, discipline-specific fields is widely seen as essential to successful uptake and use of a collection-level data standard and accompanying discovery systems and catalogues (ref).
Controlled vocabularies should be identified or developed for as many fields as is feasible (ref). Fields most urgently in need of a controlled vocabulary could be identified via analysis of existing specimen-level records containing equivalent DwC fields.
Any consensus/community level collection data standard should not be considered complete until it has undergone adoption or testing in institutional data workflows and projects to ensure that it is fit for purpose (ref). Real-life testing and early adoption of the standard for a small set of use-cases and collection description classes would facilitate the identification and subsequent development of those fields most suited for machine access.
Fields that support an plurality of identifiers and links between the catalogue and external services will enable discovery and use by non-traditional users, e.g. visitors to a Wikipedia page following a citation link to the collection catalogue (ref). It will also improve the usability of the collection catalogue by allowing users to easily navigate to external, authoritative sources of information on topics associated with the specified collection (ref).
Fields selected for use in this manner need to be carefully evaluated and prioritised: creating and maintaining linkages between data silos is a non-trivial undertaking and the benefits to contributors, system providers and external data sources must be clearly defined (ref). There is general consensus that the following core fields should be explored: collector, species/taxa, specimen-level information, notable and/or primary collectors and associated publications (ref). Linkages should be bidirectional wherever feasible, taking into account each external data source’s sustainability and technical capacity in areas such as link resolution, identifier integrity and reporting (ref).
Fieldwork notes and images, type specimens, and taxonomic treatments were also mentioned as possible candidates for linkage (ref), but these fields may be more appropriately and usefully associated with specimen-level records (ref). External linkages with sources that provide usage and impact metrics could be valuable mechanisms for boosting engagement. Without support and clearly defined benefits for catalogue contributors, this may lag in existing areas of poor data-density such as south-west Asia (ref).
All of the information services proposed in the ideas paper were recognised as components that would enhance the value of the Catalogue:
Partnerships with existing digital repositories (e.g., CoL, GBIF, BHL) to deliver shared or complementary services would be beneficial for encouraging both development progress and collaboration within the existing ecosystem of research infrastructure services, tools and platforms (ref).
A collection catalogue would be broad and inclusive to be used across many disciplines that maintain collections.
Collection identifiers initiation must be accompanied by community engagement.
Controlled vocabularies should be identified or developed for TDWG CD standard for collection descriptions (
Core fields should be used for linking to external data.
The Technology category included five topics.
Good software and infrastructure will be critical to building a global collections catalogue, and creating and maintaining these is likely to be one of the more significant costs associated with the Catalogue (ref). The proposed approach would be to maintain a single master record for each collection in GrSciColl and to use existing publishing mechanisms to keep them up-to-date (ref). Wikidata might serve as a broker between other identifier systems, although it should not itself be considered an authoritative source (ref). Wikidata could also allow many more members of the community to make enhancements to data about collections and would make the collections data more discoverable.
There are national platforms that could be integrated with a global collections catalogue (e.g. Colombia’s Registro Nacional de Colecciones and Argentina’s Sistema Nacional de Datos Biológicos) but a review is required of the update frequency and data richness of each such source when compared with direct information feeds from each individual collection (ref, ref).
There are several community catalogues that are established and widely used and that should retain their own identity. These catalogues (including Index Herbariorum and GGBN) could maintain the primary version of the collections data for their focus communities and then synchronise data with GRSciColl (ref). In some cases institutes themselves will maintain their own information on local systems, or get support for publishing these data at a national level (e.g. iDigBio, Atlas of Living Australia) (ref). This will require careful consideration of how to model and manage role-based access permissions for editing collection information and nominating which source(s) should be used as the primary copy. The data standards used across the community catalogues and the global catalogue should normally be the same, but where there are differences mapping will be required to ensure they are discoverable and interoperable (ref).
Where there are other community initiatives that are also building discipline-specific catalogues there should be discussions between these communities and GBIF to understand how they can contribute to or use GRSciColl functionality (ref).
While collection management systems hold the potential to be efficient data sources for a collection catalogue, maintaining a CMS should not be a requirement for participation: a significant proportion of organisations manage their collections data solely through spreadsheet tools (ref). The GBIF IPT goes some way to reducing participation barriers for spreadsheet data at the specimen level (ref) (
For organisations where the CMS plays a central role in all aspects of the collection data lifecycle, the ability to manage collection-level records in the same system would have significant benefits. Inclusion of collection-record management functionality would reduce double-entry of data, enable links between specimen and collection records, simplify high-level reporting, enable better tracking of digitisation progress, promote consistency between common fields and potentially drive workflows around automated enhancement of specimen level records (ref).
CMS systems could automate the creation and updating of collection-level records: both descriptive and quantitative collection metadata could be produced by aggregating specimen-level records over a limited set of dimensions (ref). Specify and Symbiota both already hold some capacity for interoperability with the IPT and EML: a similar approach incorporating fields from the TDWG CD standard may be a suitable mechanism for data exchange between a CMS and the collection catalogue (ref).
Elements of this architecture are already operating in GRSciColl, including metrics derived from aggregated GBIF specimen records (ref). The MIDS (minimum information about a digital specimen) metadata standard (
A “one-size-fits-all” approach rarely works when attempting to integrate data from a variety of systems. Flexibility and agility will be important when designing the interfaces and underlying APIs (ref). The users of a global collections catalogue will have varying technical capabilities and we need to ensure participation for all, so we need to support spreadsheet uploads and web form editing. In terms of APIs and harvesting data we need to take a gradual approach at connecting, partnering and building on established infrastructures wherever possible.
Interpreting and validating data will be critical when building the global collections catalogue. Lessons from Bionomia’s implementation of an OpenRefine reconciliation endpoint would be useful in designing services. Careful consideration and potentially editing the collection model in Wikidata would allow us to more easily use Wikidata in our own reconciliation efforts and share our data more effectively (ref). The content of collection records should be interpreted and validated as much as possible so its utility and value as data can be maximized. Implementations must be designed to support and display both human- and machine-readable data and to underpin high-quality metadata management, standards compliance, reliable update mechanisms and clear provenance reporting (ref).
A single master record for each collection is required and existing publishing mechanisms should be used to keep them up-to-date.
The existing community catalogues should retain their own identity and synchronized with the global system.
Link data from existing CMS to reflect digitisation status at the collection level.
System should be accessible to both human users and machines.
The Governance category included six topics.
The starting assumption is that each institution should have responsibility and control for information on its own collections. Under some conditions, responsibility and access control may be delegated to a third party where local informatics resources are limited or non-existent (ref, ref).
Indigenous labels and worldviews should be included in collections descriptions where possible (ref).
Even when there are local resources we will need to encourage active maintenance through mixed approaches, such as training and educational outreach, how data are presented to users, and how editors are recognised and credited (ref, ref, ref). Formally incorporating the maintenance of collections information into organisational roles would be ideal, but this has been challenging in the past (ref).
Although it is assumed that institutions and, by implication, curators will provide and maintain collection information, there is an obvious concern that they may not engage with this international initiative to take ownership of their information. Training and incentives may help to change this. Without appropriate incentives, curators may not necessarily benefit directly from improvements in publicly accessible data for their collections.
For some communities, metadata on collections (or parts of collections) are already included in multiple collection catalogues owing to overlaps in scope (ref). We need to avoid duplication of effort wherever possible through integration and interoperability.
There are several examples of national organisations which may act as intermediaries, or already curate national collections data (e.g. NatSCA’s FENSCORE, iDigBio and Atlas of Living Australia). These could champion the global catalogue at a national level and help to broker data using established networks and infrastructure (ref, ref).
Publishers of scientific literature have a significant role in existing communities of practice: they are among the largest users of collection codes and could effectively promote their use and encourage linkage. They may also serve as a source for data on collections that may not be recorded elsewhere (e.g. private collections) (ref).
Further discussion is needed to identify the best ways to encourage, support and engage existing communities since these will be critical in encouraging and facilitating voluntary additions and updates to the catalogue. At some level, a federated architecture will be required to allow the global catalogue to be constructed as a mosaic of contributions from different communities and services, each with their own focus and strengths (ref).
There was limited discussion and was covered in more detail in section 3.3 Technology.
This discussion was merged into section 3.4.2. Communities of Practice.
The catalogue can raise awareness of collections and act as a free advertisement by displaying branding and use of rolling highlights on the home page, etc. It would also be possible to develop functionality that generates metrics that may be of use when reporting to stakeholders, preparing funding requests, prioritising internal curatorial efforts, seeking to understand the value of collections or seeking potential collaborators. It should however be noted that metrics and metadata on collection activities are not universally considered to have a positive effect. Some stakeholders may have concerns that such information could be used to impose changes in performance management approaches or could lead to undesirable public recognition (ref). More consideration is necessary to understand how to establish metrics while mitigating these perceived risks.
While not an incentive, lowering the technical barriers for editors and contributors makes participation more likely (ref). This could be achieved through financial support for training courses or for projects to improve collections data. Free collection management software and technical support may also significantly increase engagement.
A sense of ownership is important for long-term engagement, and it is more sustainable to equip contributors to take control than to provide ongoing data support (ref).
Governance and technical infrastructure both require funding and support. This could be achieved by formally including responsibility for the catalogue in the mission of GBIF or another trusted infrastructure partner. National and regional consortia (e.g CETAF) that would benefit from a collections catalogue have a vested interest in ensuring the long-term sustainability of the solution (ref). Even with such support, long-term funding will be challenging. Government agencies, including research councils, and large collections are also potential sources of funding and support (ref).
Some regions will be able to contribute staff time and potentially funding, but there are areas where economic or legal constraints will make contributions difficult. To ensure global participation and sustainability we must consider how we can support less-resourced regions (ref).
Stakeholders will require metrics and performance indicators justify long-term support. Sustained growth, data quality and fitness-for-use are some of the potential metrics that will need to be monitored (ref).
Mechanisms for outreach and training are critical for success.
Governance should build on existing communities of practice.
Formal acknowledgment of the work of collections through metrics and metadata is critical, but this will not by itself be sufficient to secure success.
Metrics and performance indicators will be required not just for the individual collections but for the catalogue itself.
As recognised during the consultation, GRSciColl can serve as the central linkage point at the global scale for dispersed activity towards developing and maintaining the catalogue of the world's natural history collections. GBIF continues to work to develop services and improve the value of GRSciColl so it can serve as this foundational resource.
The GBIF Secretariat prepared a priority roadmap in 2021 (
The roadmap identified six key priorities to progress.
Reduce the amount of duplicate records
Linking to Index Herbariorum and iDigBio enriched the catalogue, but also increased the number of duplicate entities requiring manual intervention. Future duplication of records to be addressed by:
Documenting guidelines on how a data manager can resolve duplicate issues [REG-316]. The guidelines will provide example scenarios, explain the recommended approach to defining codes and explain the implications on external systems (see master data management below).
Develop tools that help identify potential duplicate records and alert data managers [REG-191 -now implemented]
Allow any user to propose changes
At present, the process for feedback and corrections is weak and does not allow proposed changes to be supplied in a structured form, to be addressed by:
Developing an interface allowing any user to propose a change to any/all fields and state whether they have authority to approve these changes. Changes are then to be reviewed and applied by the editorial team [REG-CONSOLE-376 - now implemented].
Improve documentation
Support materials wll be improved in the following ways:
Documenting the technical aspects of the system, focusing on the data model [REG-317], authorization rules [REG-310] and the details around master data management (see below).
Documenting the guidelines for data editors including the decision process of merging entities and assigning IDs and codes [DP-3] [REG-316].
Grow the pool of editors
Curation tasks have in the past been handled by a small editorial team. Resources to be increased by:
Presenting the system at the GBIF global nodes meeting and inviting GBIF node managers to assign staff to assist with specific identified tasks (arranged as a to-do list and allowing contributions and community involvement to be measured).
Reviewing the authorization rules so that editors can be granted access to work on only those areas they are responsible for [REG-310 - now implemented]
Define and implement the master data management solution
Multiple metadata sources may exist for the same collection and require resolution. For example, information may be available from a metadata description associated with a specimen dataset, an existing GRSciColl entry and an Index Herbariorum record. This is a problem known as master data management (
Defining, implementing and documenting the approach taken by the catalogue for handling differing views of metadata [REG-319, now implemented]
Develop a richer user interface
Many improvements are possible to support users of GRSciColl, including:
Implementing a new user interface based on visual concepts including:
Institution and collection search and detail pages
Integration of specimen-related occurrences (search, maps, gallery, detail, clustering)
Capability for any user to “suggest a change” [now implemented]
Exploring citation tracking based on data mediated through GBIF for GRSciColl institutions and collections [REG-323]
Attending to branding with a call for institutions to review their data and clear instructions on how to suggest edits.
By August 2022, progress had been made against most of the priorities in the GRSciColl roadmap:
During 2022/23, GBIF will continue to work in the following areas:
Complete outstanding tasks to deploy an enriched GRSciColl providing search and access of collections, specimens and people.
Focus on content of GRSciColl: cleanup of existing entries and registration of new ones by promoting use and giving training and support to editors, and promoting consistent use of codes within data shared.
Seek to identify links between journal articles and collections based on the collection codes, within the framework of the EU-funded BiCIKL project.
Support user interface translations for GRSciColl.
Explore synchronization of content with the Consortium of European Taxonomic Facilities (CETAF) Registry (under development).
SYNTHESYS+ (submitted as SYNTHESYS PLUS), Grant agreement ID: 823827
Authors:
Donald Hobern: Conceptualization, Investigation, Writing – Original Draft, Writing – Review & Editing
Sarah Vincent: Investigation, Writing – Original Draft
Laurence Livermore: Investigation, Writing – Original Draft, Writing – Review & Editing
Tim Robertson: Conceptualization, Investigation, Writing – Original Draft, Writing – Review & Editing
Joseph T.Miller: Conceptualization, Investigation, Writing – Original Draft, Writing – Review & Editing
Quentin Groom: Writing – Original Draft, Writing – Review & Editing
Marie Grosjean: Writing – Review & Editing
Contribution types are drawn from CRediT - Contributor Roles Taxonomy.
PDF archive of materials shared as part of the consultation process on the GBIF Discourse site. All pages from this discussion are included, with separate threads for each discussion topic and additional comments in Spanish and Chinese. Daily summaries and information on the process itself and presentation materials shared.
GBIF website and associated communication channels (social media, mailing lists to all node managers, newsletter etc), the Alliance For Biodiversity mailing lists, SYNTHESYS+ mailing list and TDWG communication channels.