Navigating taxonomic complexity: A use-case report on FAIR scientific name-matching service usage in ENVRI Research Infrastructures

This paper presents a use-case conducted within the ENVRI FAIR project, examining challenges and opportunities in deploying FAIR-aligned (ensuring Findability, Accessibility, Interoperability and Reusability) scientific name-matching services across Environmental Research Infrastructures (RIs). Six services were tested using various name variations, revealing inconsistencies in match types, status reporting and handling of canonical forms and typos. These diversities pose challenges for RI data pipelines and interoperability. The paper underscores the importance of standardised tools, enhanced communication, training, collaboration and shared resources. Addressing these needs can facilitate more effective FAIR implementation within the ENVRI community and biodiversity research. This, in turn, will empower RIs to seamlessly integrate and leverage scientific names, unlocking the full potential of their data for research and policy implementation

This paper outlines the approach and findings of a FAIR implementation use-case exercise within the ENVRI Scientific Cluster,* under the purview of the ENVRI FAIR* project.The primary focus was on the utilisation of scientific names, a common element identified by various RIs within the Biodiversity and Ecosystem subdomain.The use-case targeted certain tools, chosen, based on their maturity and community acceptance, engaging in a name-matching exercise -one of the primary tasks in the data workflow -to understand current practices, tool availability and potential error scenarios.A crucial objective was to highlight implications for interoperability and linking between RIs and services.
Given that many RIs collect environmental samples and species information across diverse locations, various data pipelines are established for activities, such as recording type specimen details (Sluys 2021), CO flux measurement, species occurrence data, biomass and Leaf Area Index (for more examples, see Peters et al. (2014); Pastorello et al. (2020);  2021)).Individuals engaged in these data pipelines, often lacking specialisation in taxonomy or biology, face challenges in navigating complexities during name-matching and resolution.To address these challenges, RIs must choose a suitable tool, deciding whether to utilise existing services or develop their own tools for specific usecases.The use-case exercise was designed within this context.It is important to note that our exercise did not encompass an exhaustive evaluation of all tools, R packages, Python libraries, APIs and best practices related to taxonomic databases and scientific name harmonisation.For a comprehensive examination of these aspects, we refer readers to Grenié et al. (2022).
Currently, within the ENVRI cluster, the use of scientific names is fragmented, with participating RIs employing individual approaches without coordination or a shared agreement on information sources, tools and best practices.This decentralised handling poses risks, such as including the use of outdated data, missing scientific names and inconsistencies in synonym usage, leading to potential data interoperability issues (Reyserhove et al. 2020).Notably, no universally accepted solution exists across diverse user communities due to variations in taxonomic coverage, spanning from terrestrial to microbial domains.While a singular solution may prove impractical, fostering a shared consensus on tools, documentation, best practices and training has the potential to mitigate the mentioned challenges and significantly contribute to implementing the FAIR framework within the ENVRI cluster.
Our vision is to empower diverse user-groups to seamlessly utilise, link and integrate scientific names in datasets distributed by RIs, alongside data from other sources, without introducing errors and risks.A shared understanding and adoption of best practices for scientific name usage across all RIs are crucial.Ideally, taxonomic services should be interoperable, facilitating the integration of scientific name-matching and resolution within different data pipelines.In the subsequent sections, we delve into the approach used to test services related to scientific names, providing examples of results that highlight two critical aspects -match type and name status.The paper concludes with recommendations and potential plans for future work.

Approach
Various services are available for users to conduct matching and resolution of scientific names.For our use-case, we selected six services and tested the name-matching feature: Global Names Resolver (GNR), Catalogue of Life (COL) name-match tool, Global Biodiversity Information Facility (GBIF), LifeWatch taxon match services (LW), National Center for Biotechnology Information (NCBI) taxonomy database and World Flora Online (WFO).While there are numerous aspects to measure and compare, we simplified our focus to the results, specifically the name-matching status (e.g.exact, fuzzy or canonical) and the name status (accepted or synonym).From this initial list, we narrowed our discussion and collaboration to two services commonly used by the involved RIs: "LifeWatch taxon match services" and "Catalogue of Life name-match tool".In this section, we present the results of the exercise.
The use-case provides insight into challenges faced by RIs in the ENVRI Science Cluster, revealing their dependence on external services, experts, resources and training for internal service creation.While using a single service or taxonomic backbone for all ENVRI RIs may not be practical due to diverse domains, a common tool within specific taxonomic domains (terrestrial plants, for instance) might be feasible.Global taxonomic expertise and biodiversity data service providers (e.g.LifeWatch, COL, GBIF) can offer guidelines to assist RIs in selecting appropriate tools.Once a common set of tools is adopted, ensuring consistency in species names reported to RI-specific central databases becomes a challenge.The continuous influx of data from scientists and technicians with varying expertise underscores the need for an approach ensuring inter-comparability amongst different reference systems and facilitating periodic updates in case of species reclassification.
The specific requirements from RIs thus include the need for robust web services, tools and APIs for automated integration with local data pipelines for name-matching and resolution.Emphasising common matching options across different services, improved input data structuring guidelines, standardised service responses, manual and bulk options, interoperable API responses and comprehensive documentation and training are crucial for enhancing the efficiency and accuracy of scientific name validation within the ENVRI cluster.

Discussion
In this section, we highlight a few important aspects emerging from the matching exercise.

Different options for submitting names
A scientific name is a mandatory field in all the services used in the exercise.However, the input interfaces and matching options vary significantly.For instance, the COL crossdataset search (Fig. 1) offers "fuzzy", "exact" and "partial" options, along with choices to restrict the search to the scientific name only.It is unclear to a non-specialist how "fuzzy", "exact" and "partial" matches differ and which of these should be considered in the initial name matching phase.The COL bulk input option lacks these choices (Fig. 2).Similarly, LifeWatch (Fig. 3) provides a bulk input service with a list of different taxon services (Fig. 4) that can be selected, running as a submitted job, but without an option to choose search filters.Some services offer both Python and R packages, while others provide only one.

Canonical vs. non-canonical match
The term "canonical name" here specifically refers to the Latinised elements.There is inconsistency in how services handle canonical versus non-canonical matches.In our test, Pinus mugo Smith was considered "canonical" by GNR (the API differentiates between "supplied_name_string": "Pinus mugo Smith" and "canonical_form": "Pinus mugo"), but "none" by COL.It is difficult for non-specialists to interpret this result.Additionally, the approach of different services in handling canonical names and authorship together is not transparent.Do all services ignore the authorship and parse each component separately or do they match them as a single entity?

Flag spelling or a possible typo
In large datasets, typos and spelling errors are expected.In our test, Pinus muco Turra was not flagged as a typo or error.

Lack of common vocabulary for match types
As listed in Table 1 and the Juypyter notebook, there are different ways of describing the match type.For instance: "Exact match by canonical form" (GNR), "Fuzzy match by canonical form" (GNR), "fuzzy matches: more than one possibility" (LifeWatch),   Navigating taxonomic complexity: A use-case report on FAIR scientific name-matching ...

API and library responses
There is no common approach to providing API and software library responses after a name string are provided.Documentation quality also varies accross different providers.
See Table 2 for some examples.

Figure 2 .
Figure 2. COL name match tool for bulk input.Screenshot captured 10 Jan 2024.

Table 1 .
Table1: Examples of results from different name matching services.GNR = Global Names Resolver API; COL = Catalogue of Life Checklistbank name match web tool; GBIF = Global Biodiversity Information Facility python library and API; LW = LifeWatch Taxon Match web service; NCBI = National Center for Biotechnology Information taxonomy python library, WFO = World Flora Online R package.See the companion Jupyter notebook for details.
Grenié et al. (2022)amount of available data, RIs, stakeholders and service providers must establish broader and long-term collaboration.Initiating from this use-case group, a training session organised in November 2022 focused on the Catalogue of Life Checklist Bank, marking a step towards such collaboration.These sessions clarified issues, provided valuable feedback for COL development and served as a channel for feedback, ensuring better and more efficient usage of taxonomic data.As stakeholders like LifeWatch, COL and GBIF develop tools, we believe the ENVRI community should actively participate in this global conversation, echoing sentiments shared in discussions within the BiCIKL project (which runs until 2024; involving DiSSCo, LifeWatch and GBIF).The issues and suggestions presented here align with findings fromGrenié et al. (2022)(examining taxonomic databases) and Feng et al. (2022) (exploring the heterogeneous landscape of biodiversity databases).To conclude, we offer the following suggestions and action items for both the ENVRI community and service providers:

Table 2 .
Different API responses from different taxonomic services.These adoptions will pave the way for a more effective FAIR implementation.Building a resilient and enduring collaboration amongst taxonomists, biodiversity and ecosystem researchers, taxonomic service providers and Research Infrastructures is pivotal for success where clear roles and collaboration paths can be identified.In this context, for instance, taxonomists can concentrate on identification and classification tasks, while service providers incorporate these decisions into their offerings, ensuring seamless interoperability, linking and relationships across various tools and databases.These envisioned collaboration can empower operators and data managers working in the RIs with efficient tools to validate names against diverse databases, with service providers ensuring the necessary interlinkage with other existing databases (also see recent publications echoing similar sentiments Sandall et al. (2023) and Lien et al. (2023)).Such a framework simplifies the utilisation of scientific names within RIs and facilitates decisionmaking regarding various changes and updates originating from the reference taxonomic databases and services.International Journal on Digital Libraries1-11.https://doi.org/10.1007/s00799-020-00285-5 • Wheeler QD, Raven PH, Wilson EO (2004) Taxonomy: impediment or expedient?Science 303 (5656): 285-285.https://doi.org/10.1126/science.303.5656.285• Wilkinson M, Dumontier M, Aalbersberg I, et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship.Scientific data 3 (1): 1-9.https://doi.org/10.1038/sdata.2016.18Endnotes ENVRI is a community of Environmental Research Infrastructures, projects, networks and other diverse stakeholders interested in environmental Research Infrastructure matters.The community also includes e-infrastructures supporting the Research Infrastructures in data solutions.ENVRI FAIR (2019-2023) project aimed to assist Research Infrastructures in developing a set of FAIR data services, enhancing efficiency, supporting innovation, enabling data-and knowledge-based decisions and connecting the ENVRI Cluster to the European Open Science Cloud (EOSC).The ENVRI cluster, a Science Cluster (SC) dedicated to Environmental Sciences, comprises European RIs developed under the ESFRI (European Strategy Forum on Research Infrastructures) for coordinating long-term initiatives in environmental monitoring and promoting data and resource accessibility at the European scale.In the ecology and biodiversity fields within the ENVRI SC, the following RIs participated in this use-case: AnaEE, eLTER, ICOS, Life Watch, DiSSCo and DANUBIUS.