Research Ideas and Outcomes :
Forum Paper
|
Corresponding author: Desalegn Chala (desdchala@gmail.com)
Academic editor: Volker Grimm
Received: 11 Apr 2024 | Accepted: 24 May 2024 | Published: 11 Jun 2024
© 2024 Desalegn Chala, Erik Kusch, Claus Weiland, Carrie Andrew, Jonas Grieb, Tuomas Rossi, Tomas Martinovic, Dag Endresen
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Chala D, Kusch E, Weiland C, Andrew C, Grieb J, Rossi T, Martinovic T, Endresen D (2024) Prototype biodiversity digital twin: crop wild relatives genetic resources for food security. Research Ideas and Outcomes 10: e125192. https://doi.org/10.3897/rio.10.e125192
|
Amidst population growth and climate-driven crop stresses such as drought, extreme weather, fungal and insect pests, as well as various crop diseases, ensuring food security demands innovative strategies. Crop wild relatives (CWR), wild plants in the same genus as the crop as well as wild populations belonging to the same species as the crop, offer novel genetic resources crucial for enhancing crop resilience against these stress factors. Here, we introduce a prototype digital twin (pDT) to aid in searching and utilising CWR genetic resources. Using the MoDGP (Modelling the Germplasm of Interest) tool, the pDT enables mapping geographic areas where stress-tolerant CWR populations can be found. With its graphical user interface, it offers flexibility in selecting genetic resources from CWR tailored to enhance resilience of various crops against diverse stress factors.
crop wild relatives, biodiversity digital twin, MoDGP, Destination Earth, Sustainable Development Goals
Population growth and climate change are two of the major factors that are challenging food security. The human population has increased from one to eight billion over the past 200 years and is expected to reach 11 billion by the end of this century (
CWR are wild plant species closely related to cultivated crops. Broadly, they encompass all wild plants within the same genus as the crop (
Currently, two prominent challenges hinder the utilisation of CWR in breeding programmes. Firstly, plant breeders often depend on their established breeding lines and the potential contributions of CWR is not investigated well. Secondly, there exists a notable absence of user-friendly tools for effective utilisation.
Plant breeders typically depend on the vast collections of plant genetic resources gathered (
To address this challenge, the FIGS ("Focused Identification of Germplasm Strategy") tool was introduced, building upon earlier work by Michael Mackay (
For CWR, both collections and field evaluation data are scarce. To address this challenge, we are introducing the MoDGP (“Modelling the Germplasm of Interest”) tool in the CWR pDT. MoDGP leverages species distribution modelling, relying on occurrence data of CWR to produce habitat suitability maps, establish mathematical correlations between adaptive traits, such as tolerance to drought and pathogens and environmental factors and facilitates mapping geographic areas where populations possessing genetic resources for resilience against various biotic and abiotic stresses are potentially growing.
The main objective of the CWR pDT is to streamline the identification and utilisation of novel genetic resources from CWR through automating data flow, automated modelling runs, uncertainty analysis and timely alerts on potential genetic resources of interest for plant breeders, policy-makers and conservation scientists. Our objective includes the creation of habitat suitability maps for all CWR with sufficient occurrence data, accessible via an intuitive graphical user interface implemented with the R Shiny framework. Our model is designed to be adaptable across different crop species and traits, empowering users to address key research questions in pre-breeding, such as identifying geographic areas where populations of CWR harbouring beneficial genetic resources for enhancing crop resilience to environmental stresses are potentially growing. Additionally, in the pDT, we are developing ecogeographic land characterisation (ELC) maps to identify ELC classes that are under-represented in ex-situ seed collections. This will help to assess gaps in current collection or ex-situ conservation efforts, aiding in the strategic planning of future genetic resource collections.
The workflow of the CWR pDT includes automated access of occurrence and environmental data, automated model runs to generate habitat suitability maps for CWR via an ensemble modelling technique to predict and map stress-tolerant populations of CWR for use in breeding programmes (Fig.
Simplified workflow of the crop wild relatives prototypes digital twin. CWR - crop wild relatives; GBIF - Global Biodiversity Information Facility; Genesys - Global Information System on Plant Genetic Resources; ICARDA - International Center for Agricultural Research in the Dry Areas; MODGP - modelling the distribution of germplasms of interest.
MoDGP relies on two types of data as input. Firstly, occurrence data from GBIF (
Data and data sources for the crop wild relatives prototype digital twin.
Data type | Source | Webpage | Remarks |
Species occurrence/trait data | Global Biodiversity Information Facility (GBIF) | https://www.gbif.org | A global species occurrences data portal (> 2.6 billion; March 2024). |
Genesys PGR | Genesys PGR (genesys-pgr.org) | Genesys is an online platform where you can find information about Plant Genetic Resources for Food and Agriculture (PGRFA) conserved in gene-banks worldwide. | |
International Center for Agricultural Research in the Dry Areas (ICARDA) | https://www.icarda.org/ | Usually share data with Genesys on annual basis. | |
RAINBIO database | https://gdauby.github.io/rainbio/index.html | Contains georeferenced occurrences of vascular plants from sub-Saharan tropical Africa. | |
EURISCO crop specimens | https://eurisco.ipk-gatersleben.de/ | PGRFA data portal for European gene-banks. | |
Global Crop Wild Relative atlas | https://www.cwrdiversity.org/ | Global catalogue of crop wild relatives. | |
Plant trait database (TRY) | TRY Plant Trait Database (try-db.org) | TRY focuses on plant traits. CWR with short generation time such as herbs are particularly suitable for breeding and the database holds remarkable importance for CWR pDT. | |
NordGen Nordic catalogue | https://nordic-baltic-genebanks.org/gringlobal/search.aspx | Nordic gene-bank PGRFA data portal. | |
NordGen Nordic CWR checklist | https://doi.org/10.15468/itkype | Nordic checklist of crop wild relative species. | |
Climate | ERA5 | https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels | ERA5-Lnad is a global climate re-analysis dataset produced by the European Centre for Medium-Range Weather Forecast. It simulates climate data using hourly weather information, providing dynamic data unlike many other climate data sources. This allows users to recompute climate data by incorporating the most recent weather updates. |
Edaphic | Soil grids | https://soilgrids.org/ | SoilGrids is a dataset that provides global map for soil properties at different depths (0-5 cm, 5-15 cm, 15-30 cm and 30-60 cm) with a spatial resolution of 250 m. These properties include organic carbon, pH, sand, silt and clay fractions, amongst others. The dataset is built using machine-learning techniques and is based on a compilation of soil samples from various sources. |
Topographic | SRTM DEM | USGS EROS Archive - Digital Elevation - Shuttle Radar Topography Mission (SRTM) 1 Arc-Second Global | U.S. Geological Survey | The SRTM DEM is available at 90 m resolution globally. |
MoDGP uses different high performing species distribution modelling algorithms such as generalised additive modelling (GAM;
We aim to run models for all CWR with unique occurrence data exceeding 40. To represent the absence data, we identify 10,000 points where other species of the same genus are present, but the model target is absent or not recorded. These points are chosen within a buffer area of 15 km from known presence points.
To mitigate multicollinearity, we stack all predictor variables and extract their values at both the presence and absence points. Then, we compute Pearson’s pairwise correlations and from variables exhibiting a correlation coefficient exceeding |0.8|, only one variable with the lowest variable inflation factor being selected for model runs. Each model is replicated twice using two methods: bootstrapping and substitution of 75% of the data. In each replication, 75% of the data are randomly allocated for training, with the remaining used for evaluation. Consequently, we generate 12 habitat suitability maps for each species as three algorithms replicated twice employing two replication methods.
Results from all algorithms are evaluated against test data using area under the ROC curve (AUC) and True Skill Statistics (TSS). Maps from less performing models i.e. with AUC < 0.7 and/or TSS < 0.6 are dropped and only maps from high performing algorithms and models settings are kept.
The selected maps are combined through an ensemble approach and binary maps are produced using the maximum sum sensitivity and specificity threshold to distinguish between suitable and non-suitable pixels. Values of abiotic stresses are extracted from suitable pixels and the range of tolerance to these stress factors are generated as response curves. CWR of a given crop are ranked based on their range of tolerances to stress factors. For model targets with high tolerance to these factors, geographic areas where plants presenting the desired genotypes are potentially growing will be mapped and provided.
We will comprehensively document the entire workflow, spanning from the initial input data through each processing step and modelling, culminating in the generated output. We will ensure that the occurrence data utilised for modelling is referenced using persistent identifiers whenever feasible. Additionally, references to climate, soil and topographic data will be provided. All data employed in the models will be made publicly accessible and free for sharing and usage, with appropriate acknowledgement. The outputs from pDT and the modelling tools utilised to generate these outputs will also be openly available to the public as FAIR Digital Objects (FDOs;
FDOs integrate persistent identifiers and structured metadata to enable cross-domain interoperability, crucial for platforms like the European Open Science Cloud (EOSC*
Actual outline of data model employing the RO-Crate approach for workflow preservation and aggregation (
All developed model codes and scripts will be published as open source in the BioDT repository on GitHub (https://github.com/BioDT).
CWR pDT aims to run tens of thousands of CWR species using different algorithms and model replications. This is highly suitable for utilising parallel processing as the different model runs are independent. In preparation for executing the operation in parallel, the R environment has been containerised with Docker and the container image can be pulled and executed on the CPU partition of the LUMI supercomputer through Apptainer/Singularity and on a cloud through Docker. Initial tests have been run on LUMI-C with this setup, but the parallelisation scheme is not fully implemented yet. The large parallel computing capacity of LUMI-C is expected to be advantageous for achieving the aimed large scale model processing. In case of smaller workloads, the containerised solution is directly executable also on cloud environments.
To provide the best experience of interaction with pDT for multiple end-user groups, such as pre-breeders, researchers, conservation scientists and academicians, we are developing a web interface, based on the R Shiny (https://rstudio.github.io/shiny/authors.html) application. The interface will feature dropdown menus for crops and their corresponding:
This will allow users to effectively map the optimal overlap between environmental stress factors and habitat suitability to identify geographic areas where populations resilient to stresses can potentially thrive.
End users can collect samples from mapped areas of interest and test the performances of the genotypes. The user interface also enables users to constrain or relax the tolerance thresholds and decide geographical areas from which the germplasm of interest can be obtained. It can also enable them to prioritise the populations to be tested, based on quality and/or access. Distribution models capture potentially suitable habitats and, thus, may help the discovery of new populations and identify gaps in collection efforts or ex-situ conservation. With improvements in online occurrence data, the validity of models can also improve over time improving the robustness of the models. The modelling tools will also be published in open access journals and made available to users.
To ensure the long-term availability and accessibility of the pDT CWR, a pilot for the integration into the Big Data processing services of the Destination Earth Data Lake (DEDL;
A major objective of the pilot study is the implementation of data pipelines between DEDL as a data aggregator, processing platform and provider of earth observation data and the pDT CWR which will serve as a blueprint to facilitate the integration of more Digital Twins into DestinE’s core infrastructures. Comprehensive mappings between BioDT’s core semantic artefacts, such as schema.org/Bioschemas (fundamental for RO-Crate) and specifications used in DEDL such as SpatioTemporal Asset Catalogues (STAC*
While plant breeders often rely on their breeding lines and landraces, CWR offer not only vast diversity, but have also undergone several (and ongoing) selection pressures and, thus, encompass novel genetic resources. Representing approximately 21% of the plant kingdom (
The suitability maps produced by pDT serve diverse purposes, including in-situ conservation, restoration, ex-situ conservation and seed collection gap analysis. As the pDT is envisioned to re-run automatically on an annual basis, its results are continuously updated, offering real-time outputs. These outputs are available at global scale and can be tailored to match different geographic scales, from country to continental levels.
In general, applications and impacts of the pDT can fall into two categories:
Crop wild relatives play a critical role in ensuring food security and agricultural resilience in the face of environmental challenges. However, just like other organisms, CWR are facing threats from climate change (
To enhance the conservation and utilisation of CWR genetic resources, it is imperative to strengthen data management and collaboration amongst relevant stakeholders. Drawing from the recommendations by
Moreover, in-situ conservation efforts for CWR should be supported through coordinated actions at the local, national and regional levels. Taking existing efforts, such as the Nordic CWR policy report and regional approach advocate (
This study has received funding from the European Union's Horizon Europe Research and Innovation Programme under grant agreement No. 101057437 (BioDT project, https://doi.org/10.3030/101057437). Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.
We acknowledge the EuroHPC Joint Undertaking and CSC – IT Center for Science, Finland for awarding this project access to the EuroHPC supercomputer LUMI, hosted by CSC – IT Center for Science and the LUMI consortium, through Development Access calls.
We also thank Taimur Khan, Ingolf Kuhn, Jan Dick and one anonymous reviewer for reviewing and providing constructive comments, which have significantly improved the paper.