Research Ideas and Outcomes :
Research Article
|
Corresponding author: Maria Auxiliadora Mora-Cross (mariamoracross@gmail.com)
Academic editor: Editorial Secretary
Received: 30 Apr 2022 | Accepted: 02 Aug 2022 | Published: 23 Aug 2022
© 2022 Maria Mora-Cross, Adriana Morales-Carmiol, Te Chen-Huang, María Barquero-Pérez
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Mora-Cross MA, Morales-Carmiol A, Chen-Huang T, Barquero-Pérez MJ (2022) Essential Biodiversity Variables: extracting plant phenological data from specimen labels using machine learning. Research Ideas and Outcomes 8: e86012. https://doi.org/10.3897/rio.8.e86012
|
|
Essential Biodiversity Variables (EBVs) make it possible to evaluate and monitor the state of biodiversity over time at different spatial scales. Its development is led by the Group on Earth Observations Biodiversity Observation Network (GEO BON) to harmonize, consolidate and standardize biodiversity data from varied biodiversity sources. This document presents a mechanism to obtain baseline data to feed the Species Traits Variable Phenology or other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying such descriptions using machine learning (ML).
A workflow that performs Named Entity Recognition (NER) and Classification of morphological descriptions using ML algorithms was evaluated with excellent results. It was implemented using Python, Pytorch, Scikit-Learn, Pomegranate, Python-crfsuite, and other libraries applied to 106,804 herbarium records from the National Biodiversity Institute of Costa Rica (INBio). The text classification results were almost excellent (F1 score between 96% and 99%) using three traditional ML methods: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC), and Logistic Regression (LR). Furthermore, results extracting names of species morphological structures (e.g., leaves, trichomes, flowers, petals, sepals) and character names (e.g., length, width, pigmentation patterns, and smell) using NER algorithms were competitive (F1 score between 95% and 98%) using Hidden Markov Models (HMM), Conditional Random Fields (CRFs), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF).
Essential Biodiversity Variables, plant phenology, Natural Language Processing, machine learning, Text Classification, Named Entity Recognition.
Biological diversity is a fundamental pillar of life on Earth. Therefore, the governments of the world committed themselves through the United Nations Convention on Biological Diversity (CBD) to reduce the loss of biodiversity by intending to meet the Aichi Biodiversity Targets
Essential Biodiversity Variables (EBVs) are recommended as a global biodiversity monitoring and reporting system to assess the state of biodiversity over time. They provide the basis for generating biodiversity indicators that allow repeated assessments of progress against national and global conservation goals (e.g., the Sustainable Development Goals and the Aichi Biodiversity Targets)
Species traits include any measurable morphological, phenological, physiological, reproductive, or behavioral characteristics of individual organisms; nevertheless, they can also be generalized at the taxa and population levels. Recently, increasing efforts to integrate species traits have resulted in a significant amount of data available
Species traits have been suggested as indicator variables for monitoring the response of organisms to changes in the environment; for instance, phenological trait information related to changes in the timing of plant leafing, flowering, and fruiting can be used as an indicator of climate change impacts
On the other hand, the transformation of texts from taxonomic literature into structured data remains a key challenge in Biodiversity Informatics
Additionally, some ML algorithms, such as NER and Classification have been successfully applied to bioinformatics and biomedicine, and, more recently, to BI. Text Classification and Named Entity Recognition (NER) are classic research topics in the NLP field. Text Classification is a fundamental technique in NLP to categorize unstructured text data into predefined labels or tags (widely used in sentiment analysis). The Allerdictor tool is an example of an application in bioinformatics that models sequences as text documents and uses Multinomial Naïve Bayes (NB) or Support Vector Machine (SVM) for allergen classification
NER is the first step in many NLP tasks. It seeks to locate and classify entities' names in free text into categories. The traditional NER task has expanded beyond identifying people, location and organization to identify dates, email addresses, book titles, protein names, numbers, amongst other applications. Additionally, there has been a strong interest in using NER for extracting product attributes from online data due to the rapid growth of E-Commerce
The main objective of this project was to obtain baseline data to feed the Species Traits Variable Phenology and other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying the descriptions using machine learning (ML). To achieve this goal, an ML workflow was tested to classify specimen descriptions to determine if the plant had flowers and/or fruits at the time of collection and to extract species characters and structure names mentioned in the descriptions. A database with 106,804 records from the Herbarium of the National Biodiversity Institute of Costa Rica (INBio) was used to illustrate the proposed approach,
The remainder of the paper is structured as follows: Section "Materials and methods" provides the detailed workflow of the proposed material and methods, section "Results" presents the evaluation metrics and results, and section "Discussion" analyze the results. Finally, conclusions and future work are discussed in "Conclusions".
This research work presents an effort to extract species morphological characters and structure names using NER algorithms and classify specimen morphological descriptions to determine if a given plant had flowers or/and fruits at the time of collection.
Successfully applying ML algorithms to NLP problems requires defining a workflow that includes phases like data selection and pre-processing, model training and test and model deployment. Fig.
The proposed general workflow includes two phases: A) Data Selection and Preprocessing using the Atta database (INBio). First, the data were cleaned by removing duplicate records, records written in English and null morphological descriptions, amongst other processes. Then, two datasets were selected for the next phase, one for Classification and one for NER. Those datasets were used for training and test activities. B) During the Models Training and Test phase, models were generated using algorithms such as: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC) and Logistic Regression (LR) for Classification and Hidden Markov Model (HMM), Conditional Random Fields (CRF), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF) for NER. Metrics like accuracy, precision, recall, and F1 score were used to test them.
A. Data Selection and Processing Phase
A.1. Atta Dataset: Atta is an information system developed by INBio to manage data of specimens of different biological groups, such as plants, arthropods, fungi, and nematodes.
The database contains 350,007 records from the kingdom Plantae. Data related to taxonomy (i.e., scientific name and higher taxonomy); plant specimens (i.e., morphological description, date collected, locality, collectors, and sampling protocol, amongst other data); and geospatial data (i.e., locality and geographic coordinates) were obtained from Atta. All the selected specimens were collected in Costa Rica.
Fig.
A.2. Cleaning and Random Selection of Data: In this project, 106,804 records from Atta were used. Atta contains 350,007 records from the kingdom Plantae. Herbarium rules and regulations state to send duplicate specimens to the National Museum of Costa Rica and the Missouri Botanical Garden, so from this figure, 64% are duplicate records. After removing duplicate records, records without morphological description, discarded specimens, and descriptions written in English, about 93% of the remaining records (i.e., 106,804 records) were tagged (i.e., they were assigned to one of the classification target classes: has_flowers and has_fruits).
A.3. Tagging Data for Multi-label Classification: The texts used in the experiments correspond to the morphological description of 106,804 specimens. Morphological descriptions contain statements that detail morphological aspects (i.e., shape and structure) of specimens, which are useful to study and identify them. Statements may describe structures, substructures, characters, states, and relationships between structures (e.g., leaves, apex, flowers, flower buds, or fruits). The characters are, for instance, length, width, pigmentation patterns, smell, or architecture. An example of a description is the statement “Arbolito de 7-9 m x 10 cm dap. Corteza lisa, amarillo-cafezuzco, exfoliante. Brotes vegetativos verde-tenue con pubescencia blanca, conspicua, caulifloro. Frutos inmaduros, esferoides, verde-tenue”. (Small tree 7-9 m x 10 cm DBH. Smooth, yellow-brown, exfoliating bark. Faint-green vegetative shoots with white, conspicuous, cauliflorous pubescence. Immature, spheroid, faint-green fruits).
Morphological descriptions of plant specimens use a semi-structured language characterised by
Supervised machine-learning algorithms were used to classify descriptions. Training supervised models involves adjusting their parameters using examples that allow models to map an input to the desired output, in this case, the target classes. Examples were built from the specimens' morphological descriptions by manually assigning each description to one of the classes (i.e., has_flowers and has_fruits). For example, the morphological description "Creciendo en tronco seco. Flores naranjas. Muestra conservada en alcohol" ("Growing on the dry trunk. Orange flowers. Sample preserved in alcohol", in English) was assigned to the has_flowers class, and the description "Arbusto de 35 m. en el sotobosque. Frutos de color verde y rojo a púrpura oscuro cuando están maduros. Escaso" ("35 m shrub in the understorey. Green and red fruits to dark purple when ripe. Scarce", in English) was assigned to the has_fruits class. Descriptions were standardised by changing their contents to lowercase, removing special characters, and tokenising each description (i.e., breaking descriptions into words, symbols, or other elements called tokens).
Two classes were used to classify specimen morphological descriptions and determine if a plant had flowers or/and fruits at the time of collection: has_flowers and has_fruits, accordingly. The 106,804 records from INBio’s database (i.e., Atta) were tagged. Fig.
A.4. Tagging Data for NER: A small part of the clean records used in the classification process was randomly selected for extracting species characters and structure names using supervised ML algorithms. Eight thousand specimen records were chosen for this purpose.
To prepare examples, different standard approaches to sequence tagging
The following activities were carried out for the tagging process:
B) Models Training and Evaluation Phase
B.1. Classification: Train Models using NB, SVC, and LR: The classification of morphological description involved 106,804 specimen records used for training and test models. The experiments were carried out using Python version 3
The classification objective was to determine if each of the morphological descriptions of the specimens mentioned or not the presence of flowers or fruits, that is, to assign each description to the has_flowers and/or has_fruits classes. Each sample could be assigned to zero, one, or both classes; therefore, the classification problem corresponds to a multi-label classification task. The algorithms Multinomial Naive Bayes (NB)
The input to the models was a one-dimensional vector (x1, x2, ..., xn) with the morphological descriptions. Features were extracted from this 1D vector that was converted to a matrix of values using TF-IDF (Term Frequency-Inverse Document-Frequency) or the frequency of words occurring in the descriptions with a lower and upper boundary of the range of (1,3) for different n-grams to be extracted.
To estimate the skill of the models on new data, ten-fold cross-validation was used with the function cross_val_score (Scikit Learn) in combination with the NB, SVC, and LR algorithms
B.2. NER: Train Models using HMM, CRFs, and BI-LSTM-CRF: Out of the 106,804 specimen records, 8,000 were randomly selected, where 80% of the records were used for training, while the remaining 20% were for testing the models. The training and testing of the models were done using Python version 3
The aim of applying NER tagging to the data was to extract characters and structure names from morphological descriptions (e.g., flowers, trunk, color, height) where every token of a description was assigned a B, I or O tag. With this purpose in mind, the algorithms CRFs
In order to train the HMM model, bigram, sequence starting, and sequence ending counts were used to estimate the probability distribution and generate every state and transition that the model would use for its predictions.
The way the data were handled to train the CRFs model was to convert each token in the training data into a feature that would later be fed to the model. The characteristics considered for every word were the word itself, its last three letters, if it was a punctuation mark or if it was a digit, its POS tag, and the first two letters of the POS tag. Each feature was processed using its own characteristics combined with the next and previous words in the sentence (if applicable). Afterwards, the model was trained with the hyperparameters established in Table
Hyperparameters |
Values |
---|---|
Coefficient for L1 penalty |
0.1 |
Coefficient for L2 penalty |
0.1 |
Maximum Iterations |
40 |
To train the BI-LSTM-CRF model, every word in the dataset was put into a dictionary that was later passed to the model; this had to be done with all records. The model worked with every sentence not as a string of words, but as a tensor of their respective indexes in the word dictionary. After obtaining the ready data, the model ran a forward pass with the negative log-likelihood cost function, then computed the loss and gradients, and updated the model parameters. This process was done for every sentence in the training set for every epoch. The model was trained with the hyperparameters established in Table
Hyperparameters |
Values |
Hidden dimension |
4 |
Embedding dimension |
5 |
Learning rate |
0.01 |
Weight decay |
1e-4 |
Epochs |
20 |
B.3. Models Evaluation (Accuracy, Precision, Recall, and F1 score): The metrics generally used in classification and NER problems to evaluate the results are precision and recall
This section presents a report of the experimental results for both classification and NER tests.
Classification of morphological descriptions of specimens. Performance of the NB, SVC and LR algorithms: Fig.
Examples of types of morphological descriptions used in these experiments.
Specimen Morphological Description |
English Translation |
Data |
"Epífita colgante. Brácteas y cáliz morado. Corola morado y blanco, estilo y estambres verde-morado, pedicelo blanco-morado. Orillas del sendero." |
Hanging epiphyte. Bracts and calyx purple. Purple and white corolla, purple-green style and stamens, purple-white pedicel. Path shores. |
Scientific name: Cavendishia atroviolacea Classes: has_flowers = Yes has_fruits = No |
"Arbusto juvenil, 1.2 m; Hojas nuevas rojizas, las viejas coriáceas. Común en barrancos al lado de la carretera. Voucher para estudio filogenético/adn- k. sytsma." |
Juvenile shrub, 1.2 m; New leaves reddish, old leathery. Common in ravines next to the road. Voucher for phylogenetic/k-dna study. sytsma. |
Scientific name: Alzatea verticillata Classes: has_flowers = No has_fruits = No |
"Hierba de 4-5 m. Pecíolos ca. 1.5-2.5 m, lámina foliar de 2-4 m. Inflorescencia péndulas,bracteas circinadas disticas, rojas, 1/4 basal rojo-amarillo. Flores amarillas, escondidas entre las brácteas. Frutos violeta, inmadura. Bajo dosel, escaso." |
Grass of 4-5 m. Petioles ca. 1.5-2.5 m, leaf blade 2-4 m. Pendulous inflorescence, circinate distichous bracts, red, basal 1/4 red-yellow. Yellow flowers, hidden amongst the bracts. Fruits violet, immature. low canopy, scarce. |
Scientific name: Heliconia pogonantha Classes: has_flowers = Yes has_fruits = Yes |
"Arbolito de 6 m x 8 cm dap. Follaje de haz verde-intenso y envés verde-tenue. Brotes vegetativos y ramitas café-tenue. Frutos anaranjado-tenue con semillas blanco-verdoso, recubierta de arillo rojo-intenso, brillante." |
Small tree of 6 m x 8 cm dbh. Foliage with an intense-green upper surface and a faint green underside. Vegetative buds and twigs light brown. Faint-orange fruits with greenish-white seeds, covered with bright, intense red arils. |
Scientific name: Trichilia quadrijuga Classes: has_flowers = No has_fruits = Yes |
Amount of specimen morphological descriptions distributed by class, average length in characters, and standard deviation.
has_flowers |
has_fruits |
Amount of records |
Min-Max Length (number of characters) |
Average Length |
Standard Deviation of Length |
No |
No |
23,254 |
4-575 |
59.88 |
40.18 |
No |
Yes |
26,900 |
7-952 |
66.15 |
34.64 |
Yes |
No |
42,949 |
11-708 |
69.55 |
38.44 |
Yes |
Yes |
13,701 |
27-895 |
93.88 |
51.13 |
Models' skills were estimated using ten-fold cross-validation to prevent overfitting and reduce bias. After executing the ten training sequences and tests of different models, metrics such as accuracy, precision, recall, and F1 score by algorithm and class were computed, and the average of the results was calculated. Table
Average precision (P), recall (R), accuracy, and F1- score (F1) computed using ten-fold cross-validation for each algorithm and class.
Algorithm |
Class |
Accuracy |
Precision |
Recall |
F1- score |
Multinomial Naive Bayes (NB) |
has_flowers |
0.9626 |
0.9462 |
0.9855 |
0.9655 |
has_fruits |
0.9759 |
0.9851 |
0.9510 |
0.9677 |
|
Average |
0.9693 |
0.9657 |
0.9682 |
0.9666 |
|
Logistic Regression (LR) |
has_flowers |
0.9888 |
0.9979 |
0.9810 |
0.9894 |
has_fruits |
0.9904 |
0.9998 |
0.9749 |
0.9872 |
|
Average |
0.9896 |
0.9989 |
0.9780 |
0.9883 |
|
Linear Support Vector Classification (SVC) |
has_flowers |
0.9946 |
0.9996 |
0.9903 |
0.9949 |
has_fruits |
0.9958 |
0.9999 |
0.9891 |
0.9944 |
|
Average |
0.9952 |
0.9997 |
0.9897 |
0.9947 |
To measure the impact of different collector's writing on the result, in a second experiment, training and test data were partitioned using the number of specimens gathered per collector. The test was carried out to verify if the resulting models were just trained to parse the writing of the prolific collectors. Specimen descriptions written by collectors with different amounts of gatherings were selected for testing models, the rest of the samples were used to train the models. Fig.
Results of applying the algorithms to text written by collectors with one collected sample up to 500 samples. The test was carried out to measure the impact of different collector's writing on the result and to verify if the resulting models were just trained to parse the writing of the prolific collectors. Training and test data were partitioned using the number of specimens gathered per collector. Specimen descriptions written by collectors with different amounts of gathering were selected for testing models, the rest of the samples were used to train the models.
NER tagging of morphological descriptions. Performance of the CRFs, BI-LSTM-CRF and HMM algorithms: Records, such as those shown in Table
Examples of types of morphological descriptions used in NER experiments.
Specimen Morphological Description |
English Translation |
Tagged Data |
"Epifita. Flores con corola rojo rosado de bordes blancos, tubo floral externo rojo rosado con pubescencia blanca, filamentos blancos, anteras y caliz verde tenue." |
Epiphyte. Flowers with corolla pink red with white borders, external floral tube pink red with white pubescence, white filaments, dim green anthers and corolla. |
Epifita. Flores[B] con corola[B] rojo rosado de bordes blancos, tubo[B] floral[I] externo[I] rojo rosado con pubescencia[B] blanca, filamentos[B] blancos, anteras[B] y caliz[B] verde tenue. |
"Liana trepadora, colgante. Brotes vegetativos cafe-rojizo. Caliz verde, corola blanca. Frutos inmaduros verdes, maduros rosado brillante." |
Hanging climbing liana. Vegetative buds reddish-brown. Green calyx, white corolla. Immature fruits green, mature bright pink. |
Liana trepadora, colgante. Brotes[B] vegetativos[I] cafe-rojizo. Caliz[B] verde, corola[B] blanca. Frutos[B] inmaduros[I] verdes, maduros rosado brillante. |
"Arbol 15 m x 25 m dap; nervios secundarios casi invisibles; vena principal hundida en el haz; hojas muy suaves; el peciolo carece de savia lechosa. Nombre comun: ninguno." |
Tree 15 m x 25 m dbh; secondary nerves almost invisible; main vein sunken in the adaxis; very smooth leaves; the petiole lacks milky sap. Common name: none. |
Arbol 15 m x 25 m dap[B]; nervios[B] secundarios[I] casi invisibles; vena[B] principal[I] hundida en el haz[B]; hojas[B] muy suaves; el peciolo[B] carece de savia[B] lechosa. Nombre comun: ninguno. |
"Arbol de 13 m x 25 cm dap. Flores blancas con un exquisito olor a dulce de caramelo. Floracion abundante. Tronco derecho, corteza escamosa pardo clara. Hojas lustrosas en ambas caras." |
Tree of 13 m x 25 cm dbh. White flowers with an exquisite smell of sweet caramel. Abundant flowering. Straight trunk, light brown scaly bark. Glossy leaves on both sides. |
Arbol de 13 m x 25 cm dap[B]. Flores[B] blancas con un exquisito olor[B] a dulce de caramelo. Floracion[B] abundante. Tronco[B] derecho, corteza[B] escamosa pardo clara. Hojas[B] lustrosas en ambas caras[B]. |
As seen in the examples, the aim was to tag the entities that appeared in the specimen’s description. With this purpose in mind, CRFs, HMM, and BI-LSTM-CRF were used.
The Sklearn
Algorithm |
Class |
Accuracy |
Precision |
Recall |
F1-score |
Conditional Random Fields (CRFs) |
B |
0.9739 |
0.9799 |
0.9739 |
0.9769 |
I |
0.8908 |
0.9480 |
0.8908 |
0.9185 |
|
O |
0.9953 |
0.9933 |
0.9954 |
0.9943 |
|
Average |
0.9533 |
0.9737 |
0.9534 |
0.9633 |
|
Weighted Average |
0.9905 |
0.9906 |
0.9906 |
0.9906 |
|
BIi-LSTM Conditional Random Field (BI-LSTM-CRF) |
B |
0.9781 |
0.9573 |
0.9782 |
0.9676 |
I |
0.8821 |
0.8037 |
0.8822 |
0.8411 |
|
O |
0.9887 |
0.9944 |
0.9887 |
0.9916 |
|
Average |
0.9494 |
0.9495 |
0.9536 |
0.9515 |
|
Weighted Average |
0.9856 |
0.9880 |
0.9880 |
0.9880 |
|
Hidden Markov Model (HMM) |
B |
0.9823 |
0.9776 |
0.9824 |
0.9800 |
I |
0.9712 |
0.8346 |
0.9713 |
0.8977 |
|
O |
0.9927 |
0.9962 |
0.9927 |
0.9945 |
|
Average |
0.9820 |
0.9361 |
0.9821 |
0.9574 |
|
Weighted Average |
0.9908 |
0.9912 |
0.9908 |
0.9909 |
A successful workflow was tested with the current project to extract phenological data from morphological descriptions of botanical specimens. Some elements of the project to highlight are:
Phenological traits data, such as the timing of plant leafing, flowering, and fruiting, have been suggested as indicators to measure how organisms respond to disturbances and changes in environmental conditions. This document has proposed a workflow that uses ML and NLP algorithms to integrate phenological data extracted from morphological descriptions in text format with other structured data available in specimen records (such as geographic coordinates, taxonomy and collection date). The integrated data, combined with abiotic records (e.g., temperature, precipitation, and humidity), could enable users (e.g., decision-makers, researchers, biodiversity institutes) to answer questions related to the possible effects of environmental changes that occur in time and space on particular species.
As far as we know, this work is the first to apply ML algorithms to specimen morphological descriptions to extract phenological data on flowering and fruiting. Results showed that it is possible to classify specimen morphological descriptions with more than 99% success (F1-score) using a multi-label approach (with classes like has_flowers and has_fruits) and to extract the characters and structure names from descriptions with more than 98% success (F1-score) using NER.
Although models, like the one proposed in this project, achieve excellent results, it is crucial to consider that, even though there are records of the planet’s biodiversity that have been systematically collected over hundreds of years, the available data are strongly unbalanced regarding taxa, locality, time, and the number of individuals.
The results of this project can be used to generate baseline data to feed the Phenology EBV from morphological descriptions of specimens written in any language, amongst other applications. Although data about the event duration as proposed by the USA-National Ecological Observatory Network (NEON)
The proposed workflow can be applied to the morphological descriptions of specimens of different biological groups,, and there are no restrictions on the language used. For biodiversity networks that integrate data from multiple sources using different languages, it is also vital to evaluate cross-lingual algorithms to alleviate the need to manually tag descriptions in a target language by leveraging tagged descriptions from other languages. For more complex texts, more robust algorithms, such as Recurrent Neural Networks - LSTM and Transformers, can be applied.
Data from the National Biodiversity Institute of Costa Rica is used in this paper. The full dataset and documentation can be downloaded from https://www.gbif.org/dataset/3717f916-d983-4a81-bb13-5f91200871a6. Code for data cleaning and analysis is provided as part of the replication package. It is available at https://github.com/colibri-itcr.