A Test Collection for Dataset Retrieval in Biodiversity Research

Searching for scientific datasets is a prominent task in scholars' daily research practice. A variety of data publishers, archives and data portals offer search applications that allow the discovery of datasets. The evaluation of such dataset retrieval systems requires proper test collections, including questions that reflect real world information needs of scholars, a set of datasets and human judgements assessing the relevance of the datasets to the questions in the benchmark corpus. Unfortunately, only very few test collections exist for a dataset search. In this paper, we introduce the BEF-China test collection, the very first test collection for dataset retrieval in biodiversity research, a research field with an increasing demand in data discovery services. The test collection consists of 14 questions, a corpus of 372 datasets from the BEF-China project and binary relevance judgements provided by a biodiversity expert.


Introduction
Dataset search and data reuse are becoming more important in scholars' research practice. Instead of recreating datasets by repeating experiments or for the comparison of new datasets with similar data collected under different conditions, scholars increasingly search for existing datasets. For example, GBIF's scientific report (GBIF Secretariat 2020) shows a growing number of peer-reviewed publications over the last decade reusing GBIF datasets. Hence, retrieval systems offered by various data publishers, archives and data portals are receiving increasing attention. Evaluations with test collections are required to determine whether a dataset retrieval system supports its users well in identifying relevant datasets. In Information Retrieval (IR), an evaluation setting consists of a corpus of documents, a certain amount of questions or queries and human assessments that document which datasets match which queries. Driven by the highly influential and annual Information Retrieval Challenge, TREC (https://trec.nist.gov/), a multitude of test collections are available for the retrieval of publications and websites in different application domains. However, appropriate test collections are missing for dataset retrieval. While longer textual resources, i.e. documents, constitute the information base in document retrieval, dataset retrieval is usually based on structured metadata accompanying each dataset (Khalsa et al. 2018). Test collections for dataset search need to include these metadata.
One research domain with an increasing demand for data discovery services is biodiversity research, a domain that examines the variety of species, their genetic diversity and ecological diversity. Scholars working in the fields of biodiversity research often need to search and combine several datasets from different experiments to answer a research question. Hence, proper data retrieval systems are needed to support these data discovery tasks. In this work, we introduce the first test collection for dataset retrieval in biodiversity research. We focus on an important sub-domain in biodiversity research, ecosystem functioning, that has been intensively studied in the BEF-China project (https://www.befchina.com). In this project, 372 datasets are publicly available with structured metadata files. Metadata are descriptive information about the measured or observed primary data and contain information such as author, collection time, title, abstract, keywords and parameters measured. Depending on the domain, metadata are provided in a specific structure or metadata schema. In the BEF-China project, all metadata files are provided in EML, the Ecological Metadata Language (KNB (ecoinformatics.org)). Providing relevance judgements is a very time-consuming task. Therefore, we only selected 14 questions collected in various biodiversity projects. They do not cover all search interests in biodiversity research, but reflect real world information needs of scholars. Binary human relevance assessments are provided by a biodiversity expert.
The structure of the paper is as follows: at first, we present related work. Afterwards, we describe the creation steps of the BEF-China test collection, including data collection, question collection and human ratings. At the end, we conclude with a summary of our findings.

Related Work
A retrieval system consists of a collection of documents (a corpus) and a user's information needs that are described by a set of keywords (query). The main aim of the retrieval process is to return a ranked list of documents that match the user's query. Numerous evaluation measures have been developed to assess the effectiveness of retrieval systems in terms of relevance. For this purpose, a test collection is required that consists of three parts (Manning et al. 2008): 1.
a corpus of documents, 2.
representative information needs expressed as queries and 3.
a set of relevance judgements provided by human judges containing assessments of the relevance of a document for given queries.
If judgements are available for the entire corpus, they serve as baseline ("gold standard") and can be used to determine the fraction of relevant documents a search system finds for a specific query.  (Tsatsaronis 2015) with a stronger focus on Question Answering (Unger et al. 2014). The competition comprises three parts, including entity extraction, the conversion of natural language questions into a semantic web format, such as RDF triples (https://www.w3.org/TR/rdf11-primer/) and the retrieval of the exact answer to a natural language query. Similar to the Genomics Track Challenge, the corpus consists of pubmed articles and the topics comprise biomedical entities such as diseases, genes, proteins, species and drugs.
The BioCADDIE Test Collection (Cohen et al. 2017) is a test collection for dataset search and provides a corpus of ~794,000 biomedical metadata files from various data repositories. Domain experts created 137 questions related to biomedicine, based on question templates considering entity types, such as data type, disease type, biological processes and organisms. The datasets were indexed in multiple search engines. For 15 selected questions, two runs were performed in each search engine and the results were merged across all systems. The final result list was evaluated by annotators with biomedical expertise who indicated for which question which dataset was relevant, partially relevant or not relevant.
To the best of our knowledge, there is no test collection available for dataset search in biodiversity research. Therefore, in the following, we introduce our test collection for dataset retrieval in biodiversity research.

The BEF-China Dataset Retrieval Test Collection
Biodiversity research nowadays is a very heterogenous research field that goes beyond the exploration of species richness and taxon relations. Over the last few decades, research into the relationships between biodiversity and ecosystem functioning and the consequences of biodiversity change for ecosystems, has become a key topic of interdisciplinary biodiversity research (Tilman et al. 2014). One example of such a diverse project is the BEF-China project aiming at the exploration of Biodiversity-Ecosystem Functions (BEF) in a large and highly species-rich forest in the subtropics. In order to measure ecosystem functions, such as carbon and nitrogen storage, nutrient cycling and the prevention of soil erosion, measurements were made in natural forests in the Gutianshan National Nature Reserve in Zhejiang Province (comparative study plots, CSPs) and new forests varying in diversity levels were planted in 2008 at two sites (A and B) in Jiangxi Province, China (Bruelheide et al. 2011Bruelheide et al. 2014. The project was divided into 12 sub-projects exploring different aspects of ecosystem functions, for example, primary production, plant growth and demography, woody decomposition and microbial biomass and activity.

Data Collection
The data collected in the BEF-China project are publicly provided in a corpus of 372 metadata files. Most datasets also provide open access to the primary data. The metadata information are stored in XML files following the EML metadata schema (https:// eml.ecoinformatics.org/). A data manager supported the scientists in providing proper data descriptions to ensure FAIR data and metadata (Wilkinson et al. 2016). An excerpt of an example metadata file is provided in Fig. 1.

Question Collection
The development of the test collection is driven by two requirements: we aim at providing a test collection reflecting real world information needs from biodiversity scholars. At the same time, we need to ensure that at least a fraction of the datasets in the corpus is relevant to the information needs expressed in the queries. Therefore, we selected six questions from a question corpus, collected in our previous research (Löffler et al. 2021) that are related to the BEF-China datasets. In addition, we analysed the question structure of this question corpus and grouped the noun entities into various categories such as Organism, Environment or Process. Based on these occurring categories in the questions, we established question templates such as <ORGANISM> in <ENVIRONMENT>, <DATA PARAMETER> measured for (<ORGANISM> OR <ENVIRONMENT>) and <PROCESS> influences (<ENVIRONMENT> OR <ORGANISM>). Following these templates, we created a further eight questions related to biodiversity research and ecosystem functioning. The final question corpus used for the benchmark is presented in Table 1.  Excerpt of a BEF-China metadata file .

Human Assessments
Human assessments are required to determine which dataset is relevant to which question. This assessment was provided by one of the co-authors who was the data manager of the BEF-China project at this time and who has acquired a comprehensive overview of the entire corpus of datasets. For each question, he evaluated whether a dataset is relevant or not relevant to each of the 14 questions. He was asked to judge a dataset also as 'relevant' if it only partially comprised relevant data. As the corpus also contains presentations, plot descriptions and theses established in the scope of the BEF-China project, not all datasets are relevant to one of the 14 questions. However, the biodiversity expert took the necessary time to go through all 372 datasets per question. Hence, 5208 relevance judgements (14 questions x 372 datasets) had to be conducted. Out of these 5208 relevance judgements, 239 were judged as relevant or partially relevant. These relevance judgements are provided in a txt file complying with the TREC benchmark data format. An entry in the txt file looks as follows: 1::161::1::1424380312 The first number denotes the question number, the second number provides the dataset number, the third denotes the relevance judgement (1-relevant) and the last number is the timestamp of the creation of the entry. All datasets of the BEF-China corpus that are not mentioned as relevant for a question are deemed not to be relevant. Hence, the txt file only contains the relevant datasets per question.

Conclusion
In this work, we presented the first test collection for a dataset search in biodiversity research. The test collection is publicly available. In our future work, we would like to use the presented test collection for evaluating dataset retrieval systems in the biodiversity domain, such as presented in Löffler et al. (2017).