Corresponding author: Maria Auxiliadora Mora-Cross (
Academic editor: Editorial Secretary
Essential Biodiversity Variables (EBVs) make it possible to evaluate and monitor the state of biodiversity over time at different spatial scales. Its development is led by the Group on Earth Observations Biodiversity Observation Network (GEO BON) to harmonize, consolidate and standardize biodiversity data from varied biodiversity sources. This document presents a mechanism to obtain baseline data to feed the Species Traits Variable Phenology or other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying such descriptions using machine learning (ML).
A workflow that performs Named Entity Recognition (NER) and Classification of morphological descriptions using ML algorithms was evaluated with excellent results. It was implemented using Python, Pytorch, Scikit-Learn, Pomegranate, Python-crfsuite, and other libraries applied to 106,804 herbarium records from the National Biodiversity Institute of Costa Rica (INBio). The text classification results were almost excellent (F1 score between 96% and 99%) using three traditional ML methods: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC), and Logistic Regression (LR). Furthermore, results extracting names of species morphological structures (e.g., leaves, trichomes, flowers, petals, sepals) and character names (e.g., length, width, pigmentation patterns, and smell) using NER algorithms were competitive (F1 score between 95% and 98%) using Hidden Markov Models (HMM), Conditional Random Fields (CRFs), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF).
Biological diversity is a fundamental pillar of life on Earth. Therefore, the governments of the world committed themselves through the United Nations Convention on Biological Diversity (CBD) to reduce the loss of biodiversity by intending to meet the Aichi Biodiversity Targets
Essential Biodiversity Variables (EBVs) are recommended as a global biodiversity monitoring and reporting system to assess the state of biodiversity over time. They provide the basis for generating biodiversity indicators that allow repeated assessments of progress against national and global conservation goals (e.g., the Sustainable Development Goals and the Aichi Biodiversity Targets)
Species traits include any measurable morphological, phenological, physiological, reproductive, or behavioral characteristics of individual organisms; nevertheless, they can also be generalized at the taxa and population levels. Recently, increasing efforts to integrate species traits have resulted in a significant amount of data available
Species traits have been suggested as indicator variables for monitoring the response of organisms to changes in the environment; for instance, phenological trait information related to changes in the timing of plant leafing, flowering, and fruiting can be used as an indicator of climate change impacts
On the other hand, the transformation of texts from taxonomic literature into structured data remains a key challenge in Biodiversity Informatics
Additionally, some ML algorithms, such as NER and Classification have been successfully applied to bioinformatics and biomedicine, and, more recently, to BI. Text Classification and Named Entity Recognition (NER) are classic research topics in the NLP field. Text Classification is a fundamental technique in NLP to categorize unstructured text data into predefined labels or tags (widely used in sentiment analysis). The Allerdictor tool is an example of an application in bioinformatics that models sequences as text documents and uses Multinomial Naïve Bayes (NB) or Support Vector Machine (SVM) for allergen classification
NER is the first step in many NLP tasks. It seeks to locate and classify entities' names in free text into categories. The traditional NER task has expanded beyond identifying people, location and organization to identify dates, email addresses, book titles, protein names, numbers, amongst other applications. Additionally, there has been a strong interest in using NER for extracting product attributes from online data due to the rapid growth of E-Commerce
The main objective of this project was to obtain baseline data to feed the Species Traits Variable Phenology and other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying the descriptions using machine learning (ML). To achieve this goal, an ML workflow was tested to classify specimen descriptions to determine if the plant had flowers and/or fruits at the time of collection and to extract species characters and structure names mentioned in the descriptions. A database with 106,804 records from the Herbarium of the National Biodiversity Institute of Costa Rica (INBio) was used to illustrate the proposed approach,
The remainder of the paper is structured as follows: Section "Materials and methods" provides the detailed workflow of the proposed material and methods, section "Results" presents the evaluation metrics and results, and section "Discussion" analyze the results. Finally, conclusions and future work are discussed in "Conclusions".
This research work presents an effort to extract species morphological characters and structure names using NER algorithms and classify specimen morphological descriptions to determine if a given plant had flowers or/and fruits at the time of collection.
Successfully applying ML algorithms to NLP problems requires defining a workflow that includes phases like data selection and pre-processing, model training and test and model deployment. Fig.
The database contains 350,007 records from the kingdom
Fig.
Morphological descriptions of plant specimens use a semi-structured language characterised by
They use many abbreviations and omit functional words and verbs, making sentences become telegraph phrases to save space in scientific publications; Texts are written in a very technical language because the formal terminology is based on Latin; They contain primarily names, adjectives, numbers (measures) and adverbs to a lesser extent. Verbs are seldom used; The vocabulary used is repetitive; They are short because they are included on the specimen label and sometimes the text is shortened to fit on the label. Fig. They use highly standardised syntax even though they are written in natural language.
Supervised machine-learning algorithms were used to classify descriptions. Training supervised models involves adjusting their parameters using examples that allow models to map an input to the desired output, in this case, the target classes. Examples were built from the specimens' morphological descriptions by manually assigning each description to one of the classes (i.e., has_flowers and has_fruits). For example, the morphological description "
Two classes were used to classify specimen morphological descriptions and determine if a plant had flowers or/and fruits at the time of collection: has_flowers and has_fruits, accordingly. The 106,804 records from INBio’s database (i.e., Atta) were tagged. Fig.
To prepare examples, different standard approaches to sequence tagging
The following activities were carried out for the tagging process:
In addition to the has_flowers and has_fruits classes, the 106,804 specimens were associated with other classes such as has_leaves and has_stems (has_root was not used because very few descriptions mentioned roots). These classes were used to randomly select two thousand records of each to balance the presence of structures belonging to all classes. In total, eight thousand records were selected, including records for classes has_flowers, has_fruits, has_leaves, and has_stems. FreeLing v.4.2 morphological analyzers and taggers Using the POS tags generated by FreeLing, each token was assigned a B, I, or O tag, depending on its role in the sentence. Two thousand records randomly selected from the eight thousand were assigned to each team member to manually review the labels (four team members).
The classification objective was to determine if each of the morphological descriptions of the specimens mentioned or not the presence of flowers or fruits, that is, to assign each description to the has_flowers and/or has_fruits classes. Each sample could be assigned to zero, one, or both classes; therefore, the classification problem corresponds to a multi-label classification task. The algorithms Multinomial Naive Bayes (NB)
The input to the models was a one-dimensional vector (x1, x2, ..., xn) with the morphological descriptions. Features were extracted from this 1D vector that was converted to a matrix of values using TF-IDF (Term Frequency-Inverse Document-Frequency) or the frequency of words occurring in the descriptions with a lower and upper boundary of the range of (1,3) for different n-grams to be extracted.
To estimate the skill of the models on new data, ten-fold cross-validation was used with the function cross_val_score (Scikit Learn) in combination with the NB, SVC, and LR algorithms
The aim of applying NER tagging to the data was to extract characters and structure names from morphological descriptions (e.g., flowers, trunk, color, height) where every token of a description was assigned a B, I or O tag. With this purpose in mind, the algorithms CRFs
In order to train the HMM model, bigram, sequence starting, and sequence ending counts were used to estimate the probability distribution and generate every state and transition that the model would use for its predictions.
The way the data were handled to train the CRFs model was to convert each token in the training data into a feature that would later be fed to the model. The characteristics considered for every word were the word itself, its last three letters, if it was a punctuation mark or if it was a digit, its POS tag, and the first two letters of the POS tag. Each feature was processed using its own characteristics combined with the next and previous words in the sentence (if applicable). Afterwards, the model was trained with the hyperparameters established in Table
To train the BI-LSTM-CRF model, every word in the dataset was put into a dictionary that was later passed to the model; this had to be done with all records. The model worked with every sentence not as a string of words, but as a tensor of their respective indexes in the word dictionary. After obtaining the ready data, the model ran a forward pass with the negative log-likelihood cost function, then computed the loss and gradients, and updated the model parameters. This process was done for every sentence in the training set for every epoch. The model was trained with the hyperparameters established in Table
This section presents a report of the experimental results for both classification and NER tests.
Models' skills were estimated using ten-fold cross-validation to prevent overfitting and reduce bias. After executing the ten training sequences and tests of different models, metrics such as accuracy, precision, recall, and F1 score by algorithm and class were computed, and the average of the results was calculated. Table
To measure the impact of different collector's writing on the result, in a second experiment, training and test data were partitioned using the number of specimens gathered per collector. The test was carried out to verify if the resulting models were just trained to parse the writing of the prolific collectors. Specimen descriptions written by collectors with different amounts of gatherings were selected for testing models, the rest of the samples were used to train the models. Fig.
As seen in the examples, the aim was to tag the entities that appeared in the specimen’s description. With this purpose in mind, CRFs, HMM, and BI-LSTM-CRF were used.
The Sklearn
A successful workflow was tested with the current project to extract phenological data from morphological descriptions of botanical specimens. Some elements of the project to highlight are:
The results achieved in the classification experiments showed that was feasible and generalisable to other biological groups to use the specimen morphological descriptions to automatically obtain phenological data, which most of the time, is only available in text format. The SVC models surpassed NB and LR models with an average F1 score higher than 0.995 (Table The NER experiments results showed that the HMM and CRFs model's performance had better results than the BI-LSTM-CRF model as shown in Table Certain words in the Spanish vocabulary had mistaken POS tags, where FreeLing would often confuse nouns with similar-sounding verbs, for example, words like " The NER models had problems differentiating when an entity was composed by the name of an entity and an adjective (i.e., “frutos[B] maduros[I] rojos” - ”red ripe fruit”) and when that same adjective was used to describe the entity (i.e., “frutos[B] maduros” - ”ripe fruits”). The characteristics of descriptions could have influenced that FreeLing tools were not as effective in tagging nouns that are key elements to perform NER. This result made the manual review of the tagging text more time-consuming. Although classes were highly unbalanced in all experiments and the description length ranges from 4 to 952 characters, the model's performance was not affected. This was mainly due to the large amount of data used during the training phase and the characteristics of the descriptions. The data used were collected by INBio throughout the country, over a long time and by more than 400 botanists and technicians, which gives an idea of how variable the descriptions were. Figures 4 and 5 present these data in detail. Most of the time, data of morphological descriptions of specimens are not shared in global networks that integrate biodiversity data, such as the Global Biodiversity Information Facility (GBIF), which could make it easier to carry out experiments integrating multiple sources and multiple languages.
Phenological traits data, such as the timing of plant leafing, flowering, and fruiting, have been suggested as indicators to measure how organisms respond to disturbances and changes in environmental conditions. This document has proposed a workflow that uses ML and NLP algorithms to integrate phenological data extracted from morphological descriptions in text format with other structured data available in specimen records (such as geographic coordinates, taxonomy and collection date). The integrated data, combined with abiotic records (e.g., temperature, precipitation, and humidity), could enable users (e.g., decision-makers, researchers, biodiversity institutes) to answer questions related to the possible effects of environmental changes that occur in time and space on particular species.
As far as we know, this work is the first to apply ML algorithms to specimen morphological descriptions to extract phenological data on flowering and fruiting. Results showed that it is possible to classify specimen morphological descriptions with more than 99% success (F1-score) using a multi-label approach (with classes like has_flowers and has_fruits) and to extract the characters and structure names from descriptions with more than 98% success (F1-score) using NER.
Although models, like the one proposed in this project, achieve excellent results, it is crucial to consider that, even though there are records of the planet’s biodiversity that have been systematically collected over hundreds of years, the available data are strongly unbalanced regarding taxa, locality, time, and the number of individuals.
The results of this project can be used to generate baseline data to feed the Phenology EBV from morphological descriptions of specimens written in any language, amongst other applications. Although data about the event duration as proposed by the USA-National Ecological Observatory Network (NEON)
The proposed workflow can be applied to the morphological descriptions of specimens of different biological groups,, and there are no restrictions on the language used. For biodiversity networks that integrate data from multiple sources using different languages, it is also vital to evaluate cross-lingual algorithms to alleviate the need to manually tag descriptions in a target language by leveraging tagged descriptions from other languages. For more complex texts, more robust algorithms, such as Recurrent Neural Networks - LSTM and Transformers, can be applied.
Data from the National Biodiversity Institute of Costa Rica is used in this paper. The full dataset and documentation can be downloaded from
The proposed general workflow includes two phases: A) Data Selection and Preprocessing using the Atta database (INBio). First, the data were cleaned by removing duplicate records, records written in English and null morphological descriptions, amongst other processes. Then, two datasets were selected for the next phase, one for Classification and one for NER. Those datasets were used for training and test activities. B) During the Models Training and Test phase, models were generated using algorithms such as: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC) and Logistic Regression (LR) for Classification and Hidden Markov Model (HMM), Conditional Random Fields (CRF), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF) for NER. Metrics like accuracy, precision, recall, and F1 score were used to test them.
Specimen from INBio’s collection shows the morphological description of a holotype of Stemmadenia abbreviata J. F. Morales, Novon 9(2): 236. 1999. TYPE. Costa Rica. Heredia: La Selva, OTS Field Station on the Río Peje, April 1982, B. Hammel 11680 (holotype, INB)
Collection sites of INBio’s erbarium specimens currently available at the data portal of the
Histogram of records by year of collection. Years with few records, from 1892 to 1981, were excluded in the graph (i.e., 110 specimen records were not taken into consideration).
Histogram of the number of characters, including blanks, in specimen morphological descriptions from the INBio Herbarium.
The number of morphological descriptions assigned to zero, one, or two classes (i.e., has_flowers and has_fruits).
Number of words in the specimen morphological descriptions with the B, I, O labels assigned in the selected samples.
Results of applying the algorithms to text written by collectors with one collected sample up to 500 samples. The test was carried out to measure the impact of different collector's writing on the result and to verify if the resulting models were just trained to parse the writing of the prolific collectors. Training and test data were partitioned using the number of specimens gathered per collector. Specimen descriptions written by collectors with different amounts of gathering were selected for testing models, the rest of the samples were used to train the models.
Hyperparameters used to train the CRFs model.
|
|
---|---|
Coefficient for L1 penalty | 0.1 |
Coefficient for L2 penalty | 0.1 |
Maximum Iterations | 40 |
Hyperparameters used to train the BI-LSTM-CRF model.
|
|
Hidden dimension | 4 |
Embedding dimension | 5 |
Learning rate | 0.01 |
Weight decay | 1e-4 |
Epochs | 20 |
Examples of types of morphological descriptions used in these experiments.
|
|
|
" |
Hanging epiphyte. Bracts and calyx purple. Purple and white corolla, purple-green style and stamens, purple-white pedicel. Path shores. | |
" |
Juvenile shrub, 1.2 m; New leaves reddish, old leathery. Common in ravines next to the road. Voucher for phylogenetic/k-dna study. sytsma. | |
" |
Grass of 4-5 m. Petioles ca. 1.5-2.5 m, leaf blade 2-4 m. Pendulous inflorescence, circinate distichous bracts, red, basal 1/4 red-yellow. Yellow flowers, hidden amongst the bracts. Fruits violet, immature. low canopy, scarce. | |
" |
Small tree of 6 m x 8 cm dbh. Foliage with an intense-green upper surface and a faint green underside. Vegetative buds and twigs light brown. Faint-orange fruits with greenish-white seeds, covered with bright, intense red arils. |
Amount of specimen morphological descriptions distributed by class, average length in characters, and standard deviation.
has_flowers | has_fruits | Amount of records | Min-Max |
Average |
Standard |
No | No | 23,254 | 4-575 | 59.88 | 40.18 |
No | Yes | 26,900 | 7-952 | 66.15 | 34.64 |
Yes | No | 42,949 | 11-708 | 69.55 | 38.44 |
Yes | Yes | 13,701 | 27-895 | 93.88 | 51.13 |
Average precision (P), recall (R), accuracy, and F1- score (F1) computed using ten-fold cross-validation for each algorithm and class.
|
|
|
|
|
|
Multinomial Naive Bayes (NB) | has_flowers | 0.9626 | 0.9462 | 0.9855 | 0.9655 |
has_fruits | 0.9759 | 0.9851 | 0.9510 | 0.9677 | |
Average | 0.9693 | 0.9657 | 0.9682 | 0.9666 | |
Logistic Regression (LR) | has_flowers | 0.9888 | 0.9979 | 0.9810 | 0.9894 |
has_fruits | 0.9904 | 0.9998 | 0.9749 | 0.9872 | |
Average | 0.9896 | 0.9989 | 0.9780 | 0.9883 | |
Linear Support Vector Classification (SVC) | has_flowers |
|
|
|
|
has_fruits |
|
|
|
|
|
Average |
|
|
|
|
Examples of types of morphological descriptions used in NER experiments.
|
|
|
" |
Epiphyte. Flowers with corolla pink red with white borders, external floral tube pink red with white pubescence, white filaments, dim green anthers and corolla. | Epifita. Flores[ |
" |
Hanging climbing liana. Vegetative buds reddish-brown. Green calyx, white corolla. Immature fruits green, mature bright pink. | Liana trepadora, colgante. Brotes[ |
" |
Tree 15 m x 25 m dbh; secondary nerves almost invisible; main vein sunken in the adaxis; very smooth leaves; the petiole lacks milky sap. Common name: none. | Arbol 15 m x 25 m dap[ |
" |
Tree of 13 m x 25 cm dbh. White flowers with an exquisite smell of sweet caramel. Abundant flowering. Straight trunk, light brown scaly bark. Glossy leaves on both sides. | Arbol de 13 m x 25 cm dap[ |
Average precision (P), recall (R), accuracy, and F1- score (F1).
|
|
|
|
|
|
Conditional Random Fields (CRFs) | B | 0.9739 |
|
0.9739 | 0.9769 |
I | 0.8908 |
|
0.8908 |
|
|
O |
|
0.9933 |
|
0.9943 | |
Average | 0.9533 |
|
0.9534 |
|
|
Weighted Average | 0.9905 | 0.9906 | 0.9906 | 0.9906 | |
BIi-LSTM Conditional Random Field (BI-LSTM-CRF) | B | 0.9781 | 0.9573 | 0.9782 | 0.9676 |
I | 0.8821 | 0.8037 | 0.8822 | 0.8411 | |
O | 0.9887 | 0.9944 | 0.9887 | 0.9916 | |
Average | 0.9494 | 0.9495 | 0.9536 | 0.9515 | |
Weighted Average | 0.9856 | 0.9880 | 0.9880 | 0.9880 | |
Hidden Markov Model (HMM) | B |
|
0.9776 |
|
|
I |
|
0.8346 |
|
0.8977 | |
O | 0.9927 |
|
0.9927 |
|
|
Average |
|
0.9361 |
|
0.9574 | |
Weighted Average |
|
|
|
|