Research Ideas and Outcomes :
Forum Paper
|
Corresponding author: Alex Borisenko (aborisenko@biodetics.net)
Academic editor: Editorial Secretary
Received: 07 Feb 2024 | Accepted: 28 Feb 2024 | Published: 28 Mar 2024
© 2024 Alex Borisenko, Robert Young, Robert Hanner
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Borisenko A, Young R, Hanner R (2024) A lab-centric, workflow-based data management system for environmental DNA research. Research Ideas and Outcomes 10: e120483. https://doi.org/10.3897/rio.10.e120483
|
The adoption of environmental DNA approaches as a standard tool for biodiversity monitoring leads to the increase in the number of eDNA-based species occurrence records; however, considerable disparity remains in the nature and quality of associated information, much of it unpublished and/or poorly parametrised. A robust system for tracking biological materials from their point of origin through laboratory analyses is required to connect inferred taxon occurrences with analytical history and provenance data. The bulk of eDNA research is currently driven by small-scale operations where the tasks of digitisation, organisation and cross-referencing field records with laboratory analytical data and biomaterial sample location, are often performed manually and disconnected.
We present an integrative, full-stack data management solution that provides a structured ontological concept, a minimalist data schema for eDNA research and a software application prototype designed to facilitate real-time digitisation, parsing, annotation and archival of eDNA data. The system tracks the provenance and analytical history of biological samples through a structured hierarchy of events, linked with associated digital file attachment archives, such as images and raw sequence files, and with inferred taxonomic occurrence records. The data entry process is compartmentalised and incorporated into the corresponding stages of standard operations used in fieldwork, biological collection management and laboratory analysis. Resulting data records can be integrated into various output formats required for large-scale analytics, publication and/or submission to global data aggregators. The prototype is implemented on the Microsoft 365 platform as a relational database (Access) linked to cloud-based data tables (SharePoint) and a set of associated data conversion spreadsheets (Excel). The system is designed primarily around the data management needs of small research labs; however, it is scalable to larger institutions and inter-institutional academic networks.
eDNA, database, Microsoft 365, Access, SharePoint, Excel, digitisation, fieldwork, collection management, LIMS
Environmental DNA (eDNA) approaches have gained considerable traction in biodiversity research and monitoring (
While much effort has been devoted to workflow automation for processing already deposited biodiversity data (e.g.
Data provenance has long been a critical consideration in computer science, regarded as a potential major source of ambiguity and error in downstream analysis (
Several large institutions and research networks are developing centralised field survey data management platforms (
While advocating for better resourcing of data management efforts deployed by smaller-scale eDNA research operations, we posit that increasing their efficiency as providers of accurate and standardised genomic biodiversity data requires overcoming several operational challenges outlined below.
Challenges to efficient data collection stem from the inherently complicated nature of biodiversity informatics (
Despite the multitude of biological data management systems developed to date, most of them are not readily deployable in small labs or lack the intuitive structure that make them available for a particular application (
When the outcomes of eDNA research are communicated through scientific publications or technical reports, associated raw data archives may remain in proprietary custody. If published, they may be structured according to a multitude of disparate publisher or client requirements. Publication data standards for biodiversity and ecology advocate the use of non-relational (“flat-file”) spreadsheets for data submission (
Such publication datasets are often manually collated by researchers at the end of their study or even later. Data may be sourced from disparate, disconnected and sometimes poorly-validated records made by different people during different stages of the project. The validation of researcher data against a publisher’s standards usually happens during the data submission process (e.g.
Due to the complicated nature of eDNA field sampling techniques and molecular analytical pathways, information pertaining to sourcing, managing, processing and analysing eDNA samples may comprise hundreds of data fields, many of which can be specific to particular sampling or analytical methodologies. A single research lab often hosts several projects simultaneously, each with its unique research design and methods, which may change over time. Parsing and transferring this diverse information while keeping track of the different projects is a daunting undertaking, especially when manual data manipulations are required to transcribe personal records and notes. Consolidating disparate and unstructured field/lab records retrospectively into a single dataset can also be time-consuming and mentally taxing.
Finally, when the nested relational hierarchy of research stages and material transformations is “flattened” into a single non-relational spreadsheet during integration (
The above challenges particularly affect small-scale research operations, which constitute a major part of the eDNA research establishment. As a result, a large proportion of generated genetic and survey data, even if technically published, remains practically unusable for large-scale parametrised meta-analysis. Although this is a universal problem plaguing biodiversity datasets at large (
For example, a recent comprehensive review of eDNA metabarcoding in the assessment of aquatic ecosystems (
To overcome or, at least, to alleviate these shortfalls, more attention needs to be paid to structured data digitisation. In particular, efforts should concentrate on facilitating the data capture and management needs of eDNA research operations that perform these tasks. An important step in their adherence to current standards and best practices would be the development of data management tools that are intuitive, user-friendly, locally deployable and customisable for small-scale operations, while providing downstream integration with data aggregators. Such tools should facilitate efficient tracking of biological samples and real-time data entry while reflecting the logic of each lab’s operational workflows and supporting connectivity between different stages — particularly between fieldwork and laboratory experiments. Finally, these tools should be seamlessly integrated within each eDNA research operation into a single coherent data management system built on a commonly used software platform that does not require specialised technical background or IT staff to deploy and maintain. A working prototype for such a system is described herein.
We propose using a single relational laboratory-wide database with compartmentalised, staged data entry protocols that map the operational complexity of eDNA projects. Real-time data recording and validation is facilitated by breaking it down into manageable partitions, corresponding in sequential order and content to the individual stages of the research workflow. This makes it easier for different researchers and staff members to relay information between projects and research phases using a common data standard. Under this scenario, publication datasets and summary reports can be generated using automated data queries, with moderate added effort and minimal data loss. Real-time and unambiguous linking of data records with biological materials facilitates efficient access to them when additional analyses are required.
To develop the relational data architecture that would facilitate structured data entry, it is important to conceptualise its general operational framework (what core objects or entities we are dealing with) and ontological framework (broad categories of data that are being recorded). Below, we outline these conceptual considerations in more detail. We further provide an overview of a minimalist data schema and present examples of implementing it as a standard for practical application in a small laboratory context.
Conventional zoological and botanical collecting activities usually preserve target organisms as representative biological individuals (in unitary organisms), clones (in modular organisms) or as fragments thereof. Such preserved organisms, conventionally referred to as voucher specimens (e.g.
Recent syn-ecological advances, aided by rapidly developing DNA technologies, are expanding the perception of an organism beyond its core taxonomic identity. Instead, organisms are increasingly recognised as hosts to diverse microbiomes (
Much of ecological genomic research deals with field-collected aggregations of multiple uncounted, sometimes undiscernible organisms of different, often unknown taxonomic identities. Such aggregations are commonly referred to as “bulk samples” (
The term “sample” has been widely used to describe organismal parts or pieces of tissue destined for laboratory analysis (e.g.
The term “lot” is commonly used in biological collection management to categorise a batch of multiple organisms derived from a single collecting event. It is sometimes restricted to characterise taxonomically sorted aggregations of specimens and juxtaposed to unsorted “bulk samples”, such as trap contents sourced from the field (
We posit that, despite the fundamental biological difference between lots, individual organisms and environmental DNA, the logistics of field sourcing, processing and analysing biological materials of different nature are fundamentally similar. For example, molecular analytical protocols applied in environmental DNA research can also be used for DNA-based biodiversity analysis of aggregate specimen collections, such as arthropod traps or plankton tows. Analyses of such lots can be done by picking out and sequencing individual specimens (e.g.
Data records hosted by biodiversity data aggregators, such as GBIF, are centred around “species occurrences” or “observations” (sensu
Once a MaterialSample is collected in the field (Lot), it may be processed/subdivided (Sample) and transformed, for example, through DNA extraction (Aliquot, see below). It may further be transferred between agents, research teams, labs, institutions etc. during different phases of the analytical process. During each of these stages, associated data must “pass through” the data management system of the next processing facility efficiently and without information loss. A laboratory should be able to use the same data management system to track eDNA research, to facilitate metagenomic analyses of lots (e.g. invertebrate trap contents) and to contribute reference DNA sequences derived from taxonomically curated voucher specimens. In a “simple” eDNA research scenario, the same water filter with field-collected organic slough may be registered as a Lot or as a Sample, depending on its processing stage. The proposed data management framework provides sufficient flexibility required to accommodate the various collection processing pathways for eDNA research and other emerging fields of enquiry. At the same time, it conforms to the logic model of conventional collection-based biodiversity research, which reduces potential connectivity issues during future “crosswalking” with data schemas used in natural history collection databases (
The second, operationally critical part of an occurrence record is the Event, broadly defined within Darwin Core as an “action that occurs at some location during some time” — https://dwc.tdwg.org/terms/#event (
The taxonomic identity of the MaterialSample constitutes the central piece of information contained in an occurrence record; however, it is of tangential importance to the logistics of an eDNA research project. The detection of certain taxa in a sample depends on the sampling methodology used (e.g. study site choice, filtration technique, preservation parameters) and is derived from a certain procedural outcome (e.g. targetted PCR detection, Sanger sequencing or metabarcoding). Each sample can be subdivided and processed using different analytical and/or bioinformatic pipelines or as several replicates using the same pipeline. As the limit of detection for different taxa may vary between methods and/or analytical parameters used, these analyses may yield varying taxonomic outcomes. Thus, although taxonomic occurrence records are the end-point of many eDNA research projects, they are best treated as context-dependent annotations of a MaterialSample and only meaningful if underpinned by a robust and adequately parametrised “Event—MaterialSample” data dyad. This data management approach is congruent with the emerging Collecting Event Core concept (
In eDNA research, as with other taxonomic inferences derived from collected and analysed biological objects, it is practical to shift the emphasis of the data model from Occurrence Core to Event Core. Under the Event Core logic model (
From a pragmatic laboratory data management point of view, it is important to acknowledge that the Darwin Core schema employed by GBIF was designed to facilitate biodiversity data publication (
To maintain semantic distinction between field collecting and laboratory analyses, we will refer to the former as Events and to the latter as Analyses, each characterised by a defined methodology and localised in space and time. Operationally, this allows breaking the data entry process into stages corresponding to phases of field collecting, post-field processing and laboratory analyses. Keeping track of unsuccessful Events and Analyses (“negative results”) further parametrises the methodological context for the sought taxonomic occurrence outcomes. For example, it may be useful to know that the detection of a certain taxon in a certain locality is linked to several unsuccessful attempts to recover its sequence using alternative collecting protocols or analytical parameters. Darwin Core does not accommodate for this relational complexity (
From a broad philosophical perspective, contemporary field-based biological disciplines, including eDNA research, span two classical domains of enquiry: Natural History, which aims to accrue empirical knowledge about the natural world and Natural Philosophy, which aims to infer abstract universal patterns (
It is important to contextualise our ontological framework by providing semantic clarification on our use of the terms “data” and “metadata”. We apply the original and currently predominant definition of the term “metadata” as “data about data” (
Several recent works have confounded the scope of the term “metadata” to denote sampling and provenance information (e.g.
Within the context of eDNA data ontologies and within the scope of data associated with natural objects or observations, we can define three major categories characterised by the nature of data (Table
Practical application and examples of three broad ontological categories of eDNA data (history, provenance and attributes), as they relate to the two operational entities (sampling Events and MaterialSamples).
Event (Activity) |
MaterialSample (Biological Object) |
|
Provenance: Where? When? How? |
Applies to: Spatiotemporal and circumstantial properties of the field sampling effort. Examples: sampling locality, GPS coordinates, sampling date/time, sampling method, habitat classification; molecular analytical methodology. |
Applies to: Relationship to sampling effort; record of material transactions, processing and analysis. Examples: associations between lots (field samples), laboratory samples, sub-samples, aliquots etc. |
Attributes: What? |
Applies to: Qualitative or metric data pertaining to the sampling effort. Examples: sampling depth, water temperature, turbidity, weather conditions, volume of water sampled, sampling duration. |
Applies to: Intrinsic or relational properties the biological materials (objects) collected. Examples: taxonomic position or biological condition of the specimen from which the sample was obtained, aliquot volume or DNA concentration. |
History: Why? Who? |
Applies to: Agent(s) and organisation(s) undertaking collecting/sampling activities and associated data collection. Examples: institution executing the expedition; field crew members. |
Applies to: Agent(s) and organisation(s) taking custody of materials and performing processing/analytical procedures. Examples: collection repository, collectors, analytical laboratory, sample processing technicians. |
Provenance circumscribes the spatiotemporal and circumstantial properties of the collecting or analytical events. This is the core part of the biodiversity ontology, providing details on the origin and transformations of biological objects and inferred taxonomic occurrences. Provenance data can be grouped into three broad categories of properties that describe the collecting event’s localisation in space (“where?”), time (“when?”) and the method used (“how?”). This information should be recorded at the time when the collecting or analytical event occurs and applies by extension to all biological objects (MaterialSamples) that are collected or produced as a result: lots, specimens, samples, aliquots and their derivatives.
Attributes characterise intrinsic (e.g. organismal) or relational (e.g. ecological) properties of the MaterialSample (“what?”) or related circumstantial properties of their origin. Unlike provenance information, which applies to an entire event and all derived materials, attributes may characterise a collection lot as a whole or may be restricted to individual biological objects or their derivatives (e.g. size of an organism or form of sample preservation). Data acquired during subsequent analysis, such as DNA concentration, sequence quality and interpretation of analytical results (e.g. presence/absence of target taxa) will fall into this category as well. Relevant information may be recorded at the time of collecting or during subsequent processing and analysis and may be stored in the form of structured data fields or file attachments. In the context of eDNA research, field-collected data may include a description and/or images of the filter containing the water sample.
Once a biological object is removed from nature and is transferred into human custody, it also becomes a cultural object. Historic context provides an account of agents (persons) and organisations behind the events, for example, staff undertaking the sampling activities and performing subsequent processing/analyses of MaterialSample. Thus, “historic” properties record and contextualise human interactions with biological objects, rather than their natural origin or intrinsic properties. This information provides background on the purpose of the events and overall experimental design (“why?”), the actors involved (“who?”), a record of transactions (e.g. change of ownership), processing status, storage conditions and physical location(s) of materials. It should be stressed that any information about the biological object constitutes an integral part of its research value to the scientific enterprise and, thus, by extension, of its cultural value to society at large.
Certain data types may fall into a “grey area”. For example, photos taken at the collection site can be used to parametrise provenance data; however, they also depict attributes of the collecting station and/or collecting event (see below). Likewise, a scanned page from a field journal may depict provenance information, attributes of the materials collected and historic context of the collecting process.
We present a prototype data management system aligned with the operational and ontological frameworks described above that implements the data architecture for environmental research design, integrates with standard field and laboratory workflows and is deployable in a typical eDNA research setting. This system facilitates the following operational needs:
Below is a more detailed account of the prototype database.
To address the operational needs outlined above, a data management system for eDNA research operations should meet the following functionality requirements:
Below is an outline of specific technical solutions that we have developed to address these requirements.
Our proposed data architecture is based upon minimum data requirements currently established for biodiversity research, with emphasis on eDNA and other genomic-derived data (
This conceptual data framework has been implemented as a prototype eDNA Laboratory Operations Tracking Database with a MS Access front-end graphical user interface consisting of Forms, Reports and Queries linked to data contained in back-end Tables, which may be stored locally on the workstation running the database or, preferably, hosted as SharePoint Lists on a corporate Microsoft 365 SharePoint site. User access to data contained in these tables through the database front-end or through the SharePoint website is managed by site administrators. This set-up is easily deployable across organisations with Microsoft 365 for Business, but could also work, with proper adjustments, on a locally accessible network or on a compatible cloud server. This set-up allows real-time multi-user collaboration, without the need for file versioning or manual backups. Furthermore, it requires no additional hosting and maintenance overhead or dedicated IT infrastructure or staff to manage access, permissions and security.
At its core, eDNA research is the process of inferring digital genomic data from analogue biological samples. Therefore, the referential integrity of the entire research project hinges on the researcher’s ability to discern individual samples and to track their derivatives through all stages of processing and analyses. Each material entity must be unambiguously associated with corresponding data records over the project’s entire life cycle. Thus, establishing a proper numbering convention at the source is essential. The database prototype addresses this critical step by requiring the users to devise a robust and intuitive schema of unique, human-readable identifying codes (“IDs” or Primary Keys) for all physical and ontological entities at the inception of each project and/or experiment, a step that is often neglected with “convenience” sampling (
Many database designers (e.g.
Firstly, to avoid mixing up lots or samples in the field and/or lab, each biological object must be assigned a unique ID code (e.g. “Field ID”, Sample ID”, “Catalogue Number”). This is often accompanied by affixing a label with the pre-printed ID code and almost invariably pre-dates the moment when the database record is generated; hence, the surrogate Primary Key is generally unavailable when the collection object needs to be labelled. As a result, keeping accurate track of the manually-assigned ID number — and not the random surrogate key — becomes critical to ensuring data integrity.
Secondly, most machine-generated Primary Keys represent long integer numbers incrementing from 1 to infinity and are, thus, prone to overlap (not globally unique). They are only meaningful within the context of the database table where they have been generated. When migrating data between tables and/or data management systems and especially when integrating data from multiple sources into large data aggregators, such as GBIF, new surrogate keys are generated by the system, whereas original surrogate primary keys cannot be used to identify such collated records unambiguously within the new context.
By contrast, using operator-generated, or natural, Primary Keys in biological databases, while not without its challenges (
While not intending to revisit the discussions regarding the feasibility of using persistent Globally Unique Identifiers in biological databases (e.g.
A simplified proposed schema of key operational entities of an eDNA data management system is provided on Fig.
The following two data entities provide the geographic reference for the Collecting Event Core. Although field adoption of GIS-based data capture in wildlife census has been proposed early on (
Sites
A Site is a medium-high level of geographic localisation of project activities. Using Fisheries and Oceans Canada (DFO) standard terminology (
The Site registry in the current database prototype can be cross-referenced against automatically downloadable official gazetteers of geographic localities for Canada and the United States:
the Canadian Geographical Names Database (CGNDB) provided by the Geographical Names Board of Canada – https://natural-resources.canada.ca/earth-sciences/geography/download-geographical-names-data/9245
and the USGS Geographic Names Information System (GNIS) provided by the U.S. Board on Geographic Names – https://prd-tnm.s3.amazonaws.com/index.html?prefix=StagedProducts/GeographicNames/DomesticNames/
Stations
A Station identifies an exact geolocated spot where field samples are taken. As per DFO terminology (
The following database tables contain information directly related to the Darwin Core Event data class. Note that only one of them (Events proper) is directly linked to MaterialSamples, whereas the remainder are used to provide further parametrised context (Readings and Observations) and to help structure this information into the experimental logic model (Activities).
Activities
An Activity represents a series of collecting events, measurements or observations undertaken as part of a project within a specified site, usually over a restricted timespan (e.g. one to several days); for example, a field trip or short-term expedition. Each Activity is associated (unambiguously linked) to one project (through the reference Project ID) and to one site (through the reference Site ID). An Activity is carried out within a specified Collecting Site as part of a single Project.
[Collecting] Events
The Events table characterises the specific targetted field collecting effort that results in the acquisition of biological materials (Lot; see below) at a particular Station over a specified time interval. As the name implies, it is the key element of the event-based data management schema; it is also the key point of reference linking biological materials with their provenance information. Each Event is linked to a single parent Activity and Station. Within the eDNA research context, a typical example would be the collection of aquatic DNA on to a water filter. Each sampling replicate, repeat or replication, as per DFO definition (
Readings (Instrumental Reads)
Many ecological sampling activities involve recording chemical, physical or other parameters of the environment (water, soil, air) at the locality where sampling occurs. These measurements are usually taken with specialised equipment, using a set of standards established as part of the study design. An example would be water quality measurements taken with a digital probe. The Readings table is designed to accommodate this information. Although often considered part of sample “metadata”, this information does not fit the strict metadata definition (see discussion above). It is not necessarily linked to any particular sampling Event; but may be indirectly associated with one or several Events through the corresponding Station and collection date.
Observations
Although sometimes used as an alternative name for the occurrence record (
The following tables characterise operational relationships within the MaterialSample Core data class, operationally separated into three categories: Lots (field-derived MaterialSamples), Samples (resulting from concentrating, subdividing or otherwise processing Lots at the research facility) and Aliquots (laboratory derivatives of samples destined for analysis).
Lots
The Lots table houses a registry of field-sourced biological materials (Lots) originating from a field collecting Event. As mentioned previously, the term has been co-opted from natural history collection management practice where it is used to define a set of specimens and/or samples from one or multiple organisms originating from the same collecting event that are catalogued and stored together as a single unit (
*Specimens
For research operations focused on building genomic reference collections linked to preserved voucher specimens (as discussed in the Operational Framework section), it may be optimal to designate a separate Specimens table within the proposed data architecture. However, for the purpose of most field-based eDNA research, the Lots table can accommodate essential provenance information on voucher specimens (e.g. opportunistically collected organisms), without the need to designate a separate data entity. The table is, therefore, not implemented in the prototype data schema.
Samples
The Samples table stores information about field- or laboratory-derived MaterialSamples prepared and preserved for archival storage and/or partitioned for laboratory analysis. In cases when DNA is filtered from the Lot preservation medium, the Sample would constitute a portion of that parent Lot; however, under many eDNA research scenarios, it may represent the same physical object as the entire Lot (e.g. DNA filter). In some cases, Samples may originate from external collaborators and not directly from the field. Samples are often grouped together into Containers (see below) for processing or storage efficiency; however, the latter should not be conflated with Lots. The table accommodates subdividing each sample into subsamples by creating new records linked to the parent Sample ID via the designated foreign key field.
Aliquots
Aliquots are laboratory derivatives of Samples: DNA extracts, PCR products etc. In many cases, these are transient substances that are used up during analyses. The purpose of an Aliquot is to identify the portion of each sample that is destined for a specific analytical pipeline (e.g. to sequence a particular gene region). Aliquots are arranged into Arrays (see below) for streamlined batch processing. Similar to Lots and Samples, sub-aliquots can be accounted for by creating new records linked to the parent Aliquot ID through a foreign key field.
The following two data entities are used to organise MaterialSample units for storage, batch processing and/or analysis. Unlike MaterialSample units proper, these entities are provenance-agnostic, allowing us to aggregate materials from multiple collecting Events, Activities, Sites etc., provided that each associated MaterialSample and, if applicable, its position (e.g. processing order) within the batch are unambiguously tracked.
Containers
Containers are physical objects used to store biological materials, which can be archived, relocated or processed as a single unit. Each container can be used to house one or many biological collection items (e.g. whole Lots, Samples, Aliquots or portions thereof). Containers are designed to facilitate organisation of samples together within a processing or storage batch, their localisation within the research facility and their transfer within or outside the lab. Examples include tube racks, boxes, trays, Tupperware, removable drawers etc.
Arrays
Arrays, or processing batches, are operational (logistical) counterparts of Containers that are used to organise Aliquots or their virtual derivatives in sequential order for laboratory analyses. As such, they may be somewhat “ephemeral” as physical objects, for example, PCR plates that are used up and discarded after DNA sequencing. They may also be purely virtual, for example, batches of raw DNA sequence files run through an informatics pipeline. Additional built-in functionality in the prototype database allows the user to map aliquots within an array (processing batch) and display these maps in several common formats, for example, 12 × 8 wells in a microplate or 10 × 10 sample tubes in a rack.
The following two data entities do not conform to any existing Darwin Core data classes; however, they are operationally essential for laboratory information management. As discussed earlier, we use the term “Analytical Event” to emphasise that they represent a separate category of “events”, which corresponds to the “identification process” category recognised within the Biological Collections Ontology (
Analyses
The Analyses table provides a registry of analytical procedures and stages used in laboratory analyses, for example, DNA extraction, PCR reactions, sequencing runs etc. The prototype data schema provides for many-to-many relationships between Analyses registry and associated Arrays, thereby allowing flexibility in tracking the processing of a single Array through multiple analytical stages or assembling multiple Arrays for a single analytical procedure, for example, multiplexing several PCR plates for the same sequencing run. Analyses table fields are further parametrised by ancillary registries for target Markers, PCR Primer combinations and a Multiplexing schema to map the Aliquots used in Next-Generation sequencing runs.
Experiments
Experiments represent sets of Analyses aimed at a particular research goal; for example, grouping together sets of Analyses that use the same protocols. As such, they represent abstract entities used to facilitate operational logistics and may be placed in the category below.
The following tables serve to facilitate overall operational logistics of research projects, thereby parametrising the historic component of the ontological framework. Portions of the information contained in these tables fall into the metadata category, as related to Event and MaterialSample Core components.
Projects
The Project table houses records pertaining to the logistics of administration and/or management of research and survey activities; therefore, it is not subordinate to any other database module. Projects are registered before the beginning of any related field activities or laboratory experiments. All activities, Collecting Event Core and MaterialSample Core tables are associated with the respective Project by linking each of them to the corresponding Project ID. However, Projects may have a many-to-many relationship with laboratory Analyses and Experiments, depending on experimental design and laboratory management logistics. As such, the Project may be considered as the logistical counterpart of the Experiment.
Agents
The Agents table hosts names, institutional affiliations and contact details of persons recorded in other database tables (collectors, data recorders, processing staff, collaborators, project managers, expedition leads etc.). Information from this table is linked to agent drop-down menus available in other tables.
Organizations
The Organizations table holds information about institutions, laboratories, companies and other organisations affiliated with or responsible for different projects, experiments, corresponding activities and analytical stages.
Accessions
The Accessions table is adopted from biological collection management practices (
*Loans
Although not currently implemented in the prototype database, loans (batches of biological materials dispatched to external users) constitute an important component of collection management logistics (
[Storage] Units
Storage Units are items of furniture and/or equipment used for material storage (freezers, refrigerators, cabinets, shelving units etc.). Typically, they have a fixed location in a specific building, floor, room etc. within an organisation. Each Storage Unit is linked to multiple Storage Locators (see below).
[Storage] Locators
Storage Locators are fixed compartments within Storage Units housing various physical or biological objects, specifically, collection items (Containers with Lots, Samples or Aliquots). Examples include fixed drawers, shelves or slots within freezers, shelving units or storage cabinets. Locators are important in ensuring that biological materials housed and processed by lab members can be easily found and tracked within the laboratory or collection facility. Locators have a one-to-many relationship with storage Containers and, by extension, with all associated Lots, Samples and/or Aliquots.
Equipment
Most research activities use specialised equipment, which may impact field collecting and analytical outcomes. An Equipment inventory helps to control for biases that may be introduced by using generic equipment types (e.g. technical specifications of different brands of eDNA samplers) or particular equipment items (e.g. working condition or calibration). Individual Events, Instrumental Reads and Analyses could be linked to utilised Equipment items through dedicated foreign key fields. Depending on laboratory setting, this module could be further parametrised by adding separate registries of calibration, maintenance or sign-out for use by laboratory staff and/or external collaborators.
*Supplies
Basic information on standard Supplies used in particular Events (e.g. eDNA filters) and Analyses (e.g. PCR reagents) is incorporated within the respective Events, Analyses and other data tables. For larger-scale operations, it may be useful to establish a separate registry of supplies and/or reagents that would allow evaluating the relative performance of separate supply batches or reagent stocks over time. Although not implemented in the prototype database, this data module could be added and further parametrised by logging accrued stock and its use for field or laboratory work, linked to Activities, Events, Experiments or Analyses. It could also be integrated with other enterprise resource planning modules, such as a registry of purchase orders. Such modules could be custom-built or adopted from existing off-the-shelf enterprise resource planning solutions.
The following tables provide annotations for files associated with existing data records that are either unstructured (e.g. raster images) or cannot be adequately parsed and incorporated into existing data fields without significant information loss (e.g. original Excel tables). Each data entry includes a reference (foreign key) linking it to the “parent” record (e.g. collecting Event ID) and an absolute URL to the online resource where the file is hosted. By default, attachment files are named in a self-explanatory way (i.e. by incorporating the foreign key into the file name) and are hosted in a designated folder on SharePoint or other cloud server that ensures reliable data hosting for the project’s life cycle. This allows effective retrieval of external files associated with each database record, as well as direct browsing through data folders on the cloud server and, as necessary, batch processing, backup or migration of these files. Built-in database functionality allows the user to perform batch renaming of files and automated generation of links, based on a set of standard algorithms.
Attachments
The Attachments table provides annotation for generic file attachments, such as documents, digital images or collaborator-provided Excel spreadsheets. Attachments may be linked to any record in any of the core tables within the prototype data schema using their primary ID as a foreign key. If the same primary ID is used to identify records in two or more different tables (e.g. if the syntax of the collecting Event ID is identical to the derived Lot ID), then the same attachment file (e.g. photo of the collected water filter) is linked to all corresponding database records. Hosting these files in dedicated folders on the database SharePoint server allows direct batch viewing and download through the online SharePoint interface or using OneDrive file manager applications.
Sequences
The Sequences table is used specifically to annotate DNA sequence file attachments (e.g. FASTA and FASTQ files) linked to records of individual Aliquots from which they have been generated. In addition to providing links to the Aliquot ID and file URL, this table includes fields that provide additional parametrisation, in line with the Darwin Core’s AssociatedSequences (https://dwc.tdwg.org/list/#dwc_associatedSequences) and related fields. While the database prototype does not offer built-in functionality for analysing stored sequence data files, it facilitates their direct download and processing using external software applications.
Protocols
The Protocols table provides a registry of Analytical Protocols and SOPs used in the organisation’s research operations linked to Collecting and Analytical Event Core modules. This module currently provides only basic annotation functionality; however, it offers potential for future parametrisation of research outcomes by adding custom tables with project- or laboratory-specific qualitative or quantitative metrics that vary, according to the collection or analytical protocols selected.
Several additional tables and fields within the prototype data schema allow basic taxonomic annotation for the MaterialSample (Lot, Sample or Aliquot), including modules that validate the taxonomy used against existing taxonomic references (currently, GBIF and NCBI taxonomy). The NGS_Taxonomy table provides a detailed breakdown of taxonomic occurrence records inferred from analysing Aliquot-derived raw sequence data using different informatics pipelines. By extension, these results are linked to field-sourced Lots with associated Collecting Events and other provenance information. They are also linked to laboratory-assembled arrays and associated analytical protocol parameters (Analytical Events), allowing us to backtrace the field provenance and/or methodological and procedural origin of each taxonomic occurrence record.
The overarching goal of the data management system that supports eDNA research and other work based on analysing biological materials is to ensure that each analogue MaterialSample is unambiguously linked to its corresponding Event digital data record and that all information pertaining to its provenance, attributes and history is accurately captured and parsed in real time and in adequate detail. To meet these requirements, the process of data capture should be integrated with research operations in a way that minimises additional databasing effort and provides immediate incentives to the person(s) recording the data. This could be achieved through workflow optimisation (e.g. sequential structuring of operational and data entry phases), automation (batch file renaming/linking, direct instrumental input) or procedural guidance (integration of pop-up SOPs and checklists into the user interface).
Some of this functionality has been implemented as a suite of data management tools and modules in the prototype database; however, the feasibility of their practical deployment will depend on the specifics of user organisations, their infrastructure, workforce and research settings. Below, we outline some basic principles of how the proposed data architecture could be used to address the data management needs during different research phases and suggest best practices for streamlining the process and increasing data quality.
Table
Main workflow stages involved in MaterialSample-based research and their relationship to the sample, associated data and corresponding database tables in the data management framework. Asterisks (*) mark optional tables of potential use for collection repositories that were not implemented in the prototype database.
Research Phases |
MaterialSample |
Associated Data |
Relevant Data Entities (Database Tables) |
Field Collecting |
Field sourcing (collecting), labelling of biological materials. |
Assignment of unique Lot/Specimen identifiers, field capture of provenance data (geospatial information, observations, instrumental readings and metadata). |
Sites Stations Activities Collecting Events Instrumental Reads Observations Lots Specimens* |
Pre-lab Processing and Preparation |
Preservation, sorting, labelling of biological materials; subsampling and/or preparation of (sub)samples for analysis. |
Recording associations between Lots, Specimens, Samples and Aliquots and aggregating them into corresponding Container and/or Array records. |
Lots Specimens* Samples Aliquots Containers Arrays |
Laboratory Analyses |
Analytical procedures to detect target DNA signatures and reconstruct taxonomic position/taxonomic lists. |
Tracking and digitisation of laboratory analytical procedures (lab books, LIMS etc.). |
Aliquots Arrays Experiments Analyses Protocols |
Post-laboratory Informatics Analysis |
Not applicable |
Informatics analysis of qPCR and/or DNA sequencing data, including quality scoring, demultiplexing, NGS pipelines, taxonomic queries. |
Aliquots Arrays Experiments Analyses Protocols Taxonomy Sequences |
Transfer/Acquisition |
Movement of biological materials between organisations and/or agents. |
Data migration between management systems, material transfer agreements, accessioning by the recipient. |
Lots Specimens* Samples Aliquots Containers Arrays Accessions Loans* |
Archival/Deposition |
Long-term preservation of materials for potential future re-examination and/or analysis. |
Data upload/archival in collection database. |
Locators Storage Units |
Data Publication |
Not applicable |
Batch data query and conversion into data packages and/or data submission spreadsheets formatted to the requirements of the publisher or data repository. |
All tables (potentially) |
A good practice with respect to ensuring the uniqueness of the identifiers used as primary keys (e.g. Lot ID or Container ID numbers) is to generate them in advance of a field trip or experiment using a dedicated module of the data management system. This will ensure both uniqueness and accuracy of the syntax used for any given activity and will also “preoccupy” this syntax pattern and not allow it to be registered accidentally by another user or field crew.
We have developed an MS Excel template that could be pre-filled and printed in a 4.625” x 7” weatherproof 5-inch binder format where core blocks of data and data fields are structured similarly to the database, allowing for subsequent manual database entry from hand-filled templates. The template is organised as a set of predefined forms, rather than as a non-relational spreadsheet. Each form mirrors the operating procedure performed in the field: arriving on site, confirming the location, identifying and characterising sampling stations, performing water quality tests, sample collection and recording ancillary observations.
The database prototype offers a suite of pre-defined MS Excel templates to assist users with standardised field data capture, batch data conversions and validation. The tools are being constantly updated to address emerging user needs. Several modules currently available or under development are listed below:
Under an ideal scenario, most metric data should be digitised in the field and in the lab through direct instrumental input, by feeding the digital output from measuring devices (e.g. water quality probes) and analytical instruments (e.g. DNA sequencers) into the corresponding data tables. In practice, this is not always logistically feasible and very rarely implemented, especially in remote field settings. The database prototype has built-in functionality that allows some basic data manipulations; however, it presently offers limited support for direct instrumental input. For example, it can capture the geolocation of the device running the database using its Wi-Fi, cellular connection or built-in GPS receiver. One of the logistical bottlenecks identified by beta-users of the database prototype is batch renaming and annotated archival of images associated with field collecting Stations, Events, Lots and Samples. This requirement has been addressed for database installations run on MS Windows tablets using the Microsoft Camera app. Identifying other priority areas of development, based on user input, is key to improving the database’s operational utility for small lab applications.
Data management systems benefit from active engagement of users in the process of their development (
Continued curation and management are essential for maintaining a database’s utility over time (
Sustained efforts should be devoted towards building robust, standardised, logically consistent and intuitively comprehensible naming conventions for natural Primary Keys used throughout the data management system, especially when digital records refer to analogue biological objects that are being collected, stored and analysed.
Curation of unstructured and/or analogue data (e.g. images, hand-written field notes) requires digital capture of representative data files (e.g. photos or scans), which are then appended to core database records as annotated attachments. Associated metadata, if available, could be used to parametrise such files. As mentioned earlier, the database prototype allows storing and annotating diverse types of file attachments; however, detailed user input and continued curation are required to ensure that archived files remain properly organised, referenced and readily accessible through individual database records or directly from the hosting server.
To date, robust standards have been developed for biodiversity data (
The proposed data management system aims to address the basic, yet specialised needs of eDNA data tracking that have been identified through extensive consultations with our colleagues engaged in this research. As eDNA is an actively developing field with emerging methodological standards, there is a need for structural flexibility of the data schema that could accommodate data management to support academic research and development. At the same time, eDNA’s potential for planning and regulatory applications (
From a technical aspect, the format of any software applications/databases used for data management and archival should be non-proprietary and data schema should be intuitive enough to allow migrating datasets in their entirety from one system to another, for example, as may be necessitated by database software becoming obsolete or cloud storage providers going out of service. This is particularly important for image and raw data archives (e.g. FASTQ files) associated with database records, which must remain directly accessible for batch download or transfer, while retaining their association with the corresponding data and metadata records, for example, through robust and transparent file-naming conventions.
We should emphasise that the publication of aggregated eDNA-derived taxonomic observations, however important, cannot be regarded as an adequate substitute for proper archival of complete, properly referenced and parametrised datasets by the organisations from where they have been generated. When possible, such comprehensive data archives should be backed by properly stored and curated biological samples from which the eDNA originated. The quality of the data and samples, thus archived, requires an initial investment in relevant staffing and infrastructure and further depends on a continued commitment to maintaining the accuracy and accessibility of biological materials and data records. This may be particularly hard to achieve for small-scale research operations and time-restricted surveys or monitoring projects. Their specific challenges and essential role in human understanding of planetary health across time should be more broadly acknowledged and addressed by relevant administrators, regulators and funders.
Finally, we hope that this paper will help to draw the attention of researchers to the importance of further harmonising data strategies for eDNA research with those established for more “traditional” approaches to surveying and monitoring biodiversity.
The data management system for eDNA research presented here was developed, based on the needs identified by and with support from varied large-scale research partnerships including Agnico Eagle Gold Corp, the Canadian Food Inspection Agency (via the Federal Assistance Partnership), GEN-FISH (funded by Genome Canada and the Ontario Genomics Institute; OGI-184), the Great Lakes Fisheries Commission (via the “Field-ready environmental DNA (eDNA) protocols and tools for sea lamprey assessment” project) and the Nuclear Waste Management Organisation. We also thank our colleagues for their detailed input on critical data elements that need to be captured during different stages of eDNA research and for testing the application prototypes developed during this study.
The following members of the University of Guelph research team and associates offered helpful insights into their research workflows and provided user feedback on the initial database prototype: Kate Lindsay, Erika Myler, Cameron Brown, Kathleen Nolan, Nathan Zeinstra, Kayley Head, Morgan Humphrey, Liam Lalonde, Abinaya Yogasekaram, Ian Murphy, Danielle Bourque, Tzitziki Loeza Quintana, Yoamel Milian Garcia, Sarah Adamowicz.
Further input on eDNA data management requirements and the usability of the database prototype was provided by the GEN-FISH team and associates during the deployment and testing of the database at the Great Lakes Institute for Environmental Research, University of Windsor: Matthew Yates, Joe Branget, Jonathon Leblanc, Alex Van Nynatten, Paige Breault, Keta Patel, Mohammed Zain.
As well, we thank Jarrett Phillips, Morgan Humphrey and Kayley Head for their constructive comments on the earlier drafts of the paper, Daniel Mietchen for editorial remarks and Rolando Blanco and for reviewing the submitted manuscript.
Publication of the article was supported by the first author's affiliation as Departmental Associate of the Royal Ontario Museum.
University of Guelph
Outline of main tables used in the prototype eDNA Laboratory Database with a list and definitions of data fields.