Research Ideas and Outcomes :
Case Study
|
Corresponding author: Sharif Islam (sharif.islam@naturalis.nl)
Received: 28 Feb 2024 | Published: 05 Apr 2024
© 2024 Sharif Islam, Dario Papale, Lucia Vaira, Ilaria Rosati, Johannes Peterseil, Christian Pichot
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Islam S, Papale D, Vaira L, Rosati I, Peterseil J, Pichot C (2024) Navigating taxonomic complexity: A use-case report on FAIR scientific name-matching service usage in ENVRI Research Infrastructures. Research Ideas and Outcomes 10: e121871. https://doi.org/10.3897/rio.10.e121871
|
This paper presents a use-case conducted within the ENVRI FAIR project, examining challenges and opportunities in deploying FAIR-aligned (ensuring Findability, Accessibility, Interoperability and Reusability) scientific name-matching services across Environmental Research Infrastructures (RIs). Six services were tested using various name variations, revealing inconsistencies in match types, status reporting and handling of canonical forms and typos. These diversities pose challenges for RI data pipelines and interoperability. The paper underscores the importance of standardised tools, enhanced communication, training, collaboration and shared resources. Addressing these needs can facilitate more effective FAIR implementation within the ENVRI community and biodiversity research. This, in turn, will empower RIs to seamlessly integrate and leverage scientific names, unlocking the full potential of their data for research and policy implementation.
scientific names, taxonomy, biodiversity, FAIR, ENVRI
Scientific names have served as the globally-accepted practice for identifying and describing species, bringing order to the discipline of systematics for the past 250 years (
In the world of scientific and FAIR data, Research Infrastructures (RIs) are crucial (
This paper outlines the approach and findings of a FAIR implementation use-case exercise within the ENVRI Scientific Cluster,*
Given that many RIs collect environmental samples and species information across diverse locations, various data pipelines are established for activities, such as recording type specimen details (
Currently, within the ENVRI cluster, the use of scientific names is fragmented, with participating RIs employing individual approaches without coordination or a shared agreement on information sources, tools and best practices. This decentralised handling poses risks, such as including the use of outdated data, missing scientific names and inconsistencies in synonym usage, leading to potential data interoperability issues (
Our vision is to empower diverse user-groups to seamlessly utilise, link and integrate scientific names in datasets distributed by RIs, alongside data from other sources, without introducing errors and risks. A shared understanding and adoption of best practices for scientific name usage across all RIs are crucial. Ideally, taxonomic services should be interoperable, facilitating the integration of scientific name-matching and resolution within different data pipelines. In the subsequent sections, we delve into the approach used to test services related to scientific names, providing examples of results that highlight two critical aspects - match type and name status. The paper concludes with recommendations and potential plans for future work.
Various services are available for users to conduct matching and resolution of scientific names. For our use-case, we selected six services and tested the name-matching feature: Global Names Resolver (GNR), Catalogue of Life (COL) name-match tool, Global Biodiversity Information Facility (GBIF), LifeWatch taxon match services (LW), National Center for Biotechnology Information (NCBI) taxonomy database and World Flora Online (WFO). While there are numerous aspects to measure and compare, we simplified our focus to the results, specifically the name-matching status (e.g. exact, fuzzy or canonical) and the name status (accepted or synonym). From this initial list, we narrowed our discussion and collaboration to two services commonly used by the involved RIs: "LifeWatch taxon match services" and "Catalogue of Life name-match tool". In this section, we present the results of the exercise.
The use-case provides insight into challenges faced by RIs in the ENVRI Science Cluster, revealing their dependence on external services, experts, resources and training for internal service creation. While using a single service or taxonomic backbone for all ENVRI RIs may not be practical due to diverse domains, a common tool within specific taxonomic domains (terrestrial plants, for instance) might be feasible. Global taxonomic expertise and biodiversity data service providers (e.g. LifeWatch, COL, GBIF) can offer guidelines to assist RIs in selecting appropriate tools. Once a common set of tools is adopted, ensuring consistency in species names reported to RI-specific central databases becomes a challenge. The continuous influx of data from scientists and technicians with varying expertise underscores the need for an approach ensuring inter-comparability amongst different reference systems and facilitating periodic updates in case of species reclassification.
The specific requirements from RIs thus include the need for robust web services, tools and APIs for automated integration with local data pipelines for name-matching and resolution. Emphasising common matching options across different services, improved input data structuring guidelines, standardised service responses, manual and bulk options, interoperable API responses and comprehensive documentation and training are crucial for enhancing the efficiency and accuracy of scientific name validation within the ENVRI cluster.
In this section, we highlight a few important aspects emerging from the matching exercise.
A scientific name is a mandatory field in all the services used in the exercise. However, the input interfaces and matching options vary significantly. For instance, the COL cross-dataset search (Fig.
COL Checklistbank cross dataset search. Screenshot captured 10 Jan 2024.
COL name match tool for bulk input. Screenshot captured 10 Jan 2024.
The term “canonical name” here specifically refers to the Latinised elements. There is inconsistency in how services handle canonical versus non-canonical matches. In our test, Pinus mugo Smith was considered “canonical" by GNR (the API differentiates between “supplied_name_string": "Pinus mugo Smith" and "canonical_form": "Pinus mugo"), but "none" by COL. It is difficult for non-specialists to interpret this result. Additionally, the approach of different services in handling canonical names and authorship together is not transparent. Do all services ignore the authorship and parse each component separately or do they match them as a single entity?
In large datasets, typos and spelling errors are expected. In our test, Pinus muco Turra was not flagged as a typo or error.
As listed in Table
Table1: Examples of results from different name matching services. GNR = Global Names Resolver API; COL = Catalogue of Life Checklistbank name match web tool; GBIF = Global Biodiversity Information Facility python library and API; LW = LifeWatch Taxon Match web service; NCBI = National Center for Biotechnology Information taxonomy python library, WFO = World Flora Online R package. See the companion Jupyter notebook for details.
Scientific name |
Input type |
Responses from different services |
|||||
GNR |
COL |
GBIF |
LW |
NCBI |
WFO |
||
Pinus mugo Turra |
Accepted name with authorship. Response expected: Match type: exact match with authorship. Status: accepted |
Match type: Exact , provides a score, includes multiple backbones and checklists, provides status of the request (“success” or “failure”). Status: no indication whether the name is accepted or not. |
Match type: Exact match, no score, Status: accepted. |
Match type: Exact match, no score, Status: accepted. |
Match type: Exact match, no score, includes multiple backbone and checklists, Status: no indication whether the name is accepted or not. |
Resolving failed. NCBI only matches based on the canonical name which is composed of only the Latinised elements of a scientific name. |
No Match type. Status: accepted |
Pinus mugo | Accepted name excluding the authorship Response expected: Match type: Exact match by canonical form, authorship missing. Status: accepted |
Match type: Exact match by canonical form, provides a score, includes multiple databases and checklists, Status:no indication whether the name is accepted or not. | Match type: variant, no score, Status: accepted. | Match type: Multiple matches based on the full name and genus, no matching status or score provided, Status: accepted | Match type: No exact match found, more than one possibility, no score, includes multiple databases and checklists, no indication whether name is accepted or not. | Match type: Exact match, but includes subgenus and lineage, no score, no indication whether the name is accepted or not. |
No Match type. Status: accepted |
Pinus mugo Smith |
Incorrect author. Response expected: Match Type: Exact match by canonical form, incorrect authorship. Status: Not accepted |
Match type: Exact match by canonical form, provides a score, includes multiple databases and checklists, no indication whether the name is accepted or not. |
Match type: none, no indication whether the name is accepted or not. |
Empty response |
No match found |
Empty response |
Returns multiple results (includes Pinus mugo Turra and synonyms) |
Pinus muco Turra |
Spelling mistake Response expected: Match Type: Suspect spelling mistake. Status: Not Accepted |
Match type:Fuzzy match by canonical form, spelling mistake not flagged |
Match type: none |
Empty response |
No match found |
Empty response |
Returns multiple results (includes Pinus mugo Turra and Pinus pumilio (Turra) Franco) |
Pino mugo Turra |
Incorrect genus Response expected: Match Type: Wrong genus. Status: Not Accepted. |
No result. API provides API status response. |
Match type: none |
Empty response |
No match found |
Empty response |
Returns multiple results (includes Pinus mugo Turra) |
Pinus basso Turra |
Non-existent species Response expected: Match Type: No matching species. Status: Not Accepted |
Match type: Could only match genus, provides a score, includes multiple databases and checklists, no indication whether the name is accepted or not. |
Match type: none |
Empty response |
No match found |
Empty response |
Returns multiple results (includes Pinus mugo Turra) |
There is no common approach to providing API and software library responses after a name string are provided. Documentation quality also varies accross different providers. See Table
API Responses: | ||
API Provider | Input String | JSON Response |
GBIF | Pino mugo Turra (incorrect genus) | { "confidence": 100, "matchType": "NONE", "synonym": false} |
GNR |
Pino mugo Turra (incorrect genus) |
{ "id": "zbltk7bpnuh3", "url": "http://resolver.globalnames.org/name_resolvers/zbltk7bpnuh3.json", "data_sources": [], "data": [{ "supplied_name_string": "Pino mugo Turra", "is_known_name": false }] , "status": "success", "message": "Success", "parameters": { "with_context": false, "header_only": false, "with_canonical_ranks": false, "with_vernaculars": false, "best_match_only": false, "data_sources": [], "preferred_data_sources": [], "resolve_once": false }} |
COL |
Pino mugo Turra (incorrect genus) |
Large JSON Respnse (see link) with class and family match such as "Pinopsida" |
Python Libraries: | ||
Library names | Input String | Response |
GBIF Python Client |
Pino mugo Turra |
>>> species.name_suggest(q='Pino mugo Turra')[]>>> species.name_lookup(q='Pino mugo Turra') {'offset': 0, 'limit': 100, 'endOfRecords': True, 'count': 0, 'results': [], 'facets': []} |
taxoniq (for NCBI query, provided by a third party) |
Pino mugo Turra |
KeyError: 'Pino mugo Turra' |
ncbi-taxonomist | Pino mugo Turra |
{"empty response": {"queryid": "SetLe7VFSwGDd464ZGw4IA==", "action": "skip"}} |
Taxonomic data and scientific names play pivotal roles in multidisciplinary data-linking (
These adoptions will pave the way for a more effective FAIR implementation. Building a resilient and enduring collaboration amongst taxonomists, biodiversity and ecosystem researchers, taxonomic service providers and Research Infrastructures is pivotal for success where clear roles and collaboration paths can be identified. In this context, for instance, taxonomists can concentrate on identification and classification tasks, while service providers incorporate these decisions into their offerings, ensuring seamless interoperability, linking and relationships across various tools and databases. These envisioned collaboration can empower operators and data managers working in the RIs with efficient tools to validate names against diverse databases, with service providers ensuring the necessary interlinkage with other existing databases (also see recent publications echoing similar sentiments
Funding for this work was provided by the European Union's Horizon 2020 Research and Innovation Programme under grant agreement No 824068.
Thanks to Christos Arvanitidis (LifeWatch-ERIC) and Leen Vandepitte (Flanders Marine Institute - VLIZ) for their invaluable editing and feedback. Leen Vandepitte's work is funded by Research Foundation - Flanders (FWO) as part of the Belgian contribution to LifeWatch. Additional thanks to the ENVRI FAIR WP11 team members who participated in testing and documentation.
ENVRI is a community of Environmental Research Infrastructures, projects, networks and other diverse stakeholders interested in environmental Research Infrastructure matters. The community also includes e-infrastructures supporting the Research Infrastructures in data solutions.
ENVRI FAIR (2019-2023) project aimed to assist Research Infrastructures in developing a set of FAIR data services, enhancing efficiency, supporting innovation, enabling data- and knowledge-based decisions and connecting the ENVRI Cluster to the European Open Science Cloud (EOSC). The ENVRI cluster, a Science Cluster (SC) dedicated to Environmental Sciences, comprises European RIs developed under the ESFRI (European Strategy Forum on Research Infrastructures) for coordinating long-term initiatives in environmental monitoring and promoting data and resource accessibility at the European scale. In the ecology and biodiversity fields within the ENVRI SC, the following RIs participated in this use-case: AnaEE, eLTER, ICOS, LifeWatch, DiSSCo and DANUBIUS.
https://lifewatch.be/en/2022-news-WoRMS-15th-anniversary-story-7: "Taxonomy is described sometimes as a science and sometimes as an art, but really it's a battleground".