Corresponding author: Quentin Groom (
Academic editor:
We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on state-of-the-art technologies.
Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.
Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.
Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0.
We have highlighted the main recommendations for potential pipeline components. The paper also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.
We do not know how many specimens are held in the world's museums and herbaria. However, estimates of three billion seem reasonable (
Perhaps the method most widely used today to extract these data from labels is for expert technicians to type the specimen details into a dedicated collection management system. They might, at the same time, georeference specimens where coordinates are not already provided on the specimen label. Volunteers have often been recruited to help with this process and, in some cases transcription has been outsourced to companies specializing in document transcription (
Nevertheless, human transcription of labels is slow and requires both skill to read the handwritten labels and knowledge of the names of places, people, and organisms. These labels are written in many languages often in the same collection and sometimes on the same label. Furthermore, abbreviations are frequently used and there is little standardisation on where each datum can be found on the label.
Full or partial automation of this process is desirable to improve the speed and accuracy of data extraction and to reduce the associated costs. Automating even the simplest tasks such as triaging the labels by language or writing method (typed versus handwritten) stands to improve the overall efficiency of the human-in-the-loop approach. Optical Character Recognition (OCR) and Natural Language Processing (NLP) are two technologies that may support automation. OCR aims to convert images of text into a machine-readable format (
OCR and NLP proved effective for extracting data from biodiversity literature (
This paper examines the state of the art in automated text digitisation with respect to specimen images. The recommendations within are designed to enhance the digitisation and transcription pipelines that exist at partner institutions. They are also intended to provide guidance towards a proposed centralised specimen enrichment pipeline that could be created under a pan-European Research Infrastructure for biodiversity collections (
In this paper, we focus mainly on herbarium specimens, even though similar data extraction problems exist for pinned insects, liquid collections, and animal skins. Herbarium specimens are among the most difficult targets and we know from recent successful pilot studies for large-scale digitisation such as Herbadrop (
We now outline a potential digitisation workflow, which is designed to process specimens and extract targeted data from them (Fig.
To make these text documents searchable by the type of information that they contain, another layer of information (metadata) is required on top of the original text. This step requires deeper analysis of the textual content, which is performed using NLP including language identification, Named Entity Recognition (NER), and terminology extraction. The role of language identification here is twofold. If the labels are to be transcribed manually, then language identification can help us direct transcription tasks to the transcribers with suitable language skills. Similarly, if the labels were to be processed automatically, then the choice of tools will also depend on the given language.
NER will support further structuring of the text by interpreting relevant portions of the text, such as those referring to people and locations. In addition to the extracted data and the associated metadata, the digitised collection should also incorporate a terminology that facilitates the interpretation of the scientific content described in the specimens. Many specimen labels contain either obscure or outdated terminology. Therefore, standard terminologies need to be supplemented by terminology extracted from the specimens.
Finally, the performance of both OCR and NLP can be improved by restricting their view to only the labels on the specimen. This can be achieved by segmenting images prior to processing by identifying the areas of the image that relate to individual labels. However, there are trade-offs between the time it takes to segment images compared to the improved performance of OCR and NLP. In a production environment processing time is limited because of the need to ingest images into storage from a production line through a pipeline that includes quality control, the creation of image derivatives, and image processing.
To help determine the subsequent steps in the pipeline it may be necessary to establish the language of the text recognised in the OCR step. This next step may be the deployment of language-specific NLP tools to identify useful information in the target specimen. Or it may be the channelling of the text for manual transcription. A number of software solutions exist for performing language identification and are explored in
An approach to automatic identification of data from OCR recognised text might include NER. This is an NLP task that identifies categories of information such as people and places. This approach may be suitable for finding a specimen's collector and collection country from text.
This project report was written as a formal Deliverable (D4.1) of the
As noted above there is a large body of digitised herbarium specimens available for experimentation. A herbarium is a collection of pressed plant specimens and associated data (Fig.
Each partner herbarium contributed 200 images containing a geographical and temporal cross-section of nomenclatural type and non-type herbarium specimens (Fig.
A total of nine herbaria, described in Table
To illustrate the textual content of these images and to better understand the challenges posed to the OCR, Fig.
The above list is non-exhaustive and more or less information may be recorded by the collector or determiner.
The properties of textual content of the given herbarium have been extrapolated from a random sample of 10 specimens per institution (Table
A subset of 250 images with labels written in English has been selected to test the performance of image segmentation and its effects on OCR and NER. For the purposes of these tests these images were manually divided into a total of 1,837 label segments, which were then processed separately.
The segments effectively separate labels, barcodes, and colour charts. Examples can be seen in Fig.
The role of OCR is to convert image text into searchable text. To make this text searchable by the type of information that they contain, another layer of information (metadata) is required on top of the original text. We can differentiate between three different types of metadata (
While metadata can take many forms, it is important to comply with a common standard to improve accessibility to the data. Darwin Core (
The problem of populating a predefined template such as the one defined by Darwin Core with information found in free text is an area of NLP known as Information Extraction (IE) (
This section describes a selection of software tools that can be used to automate the steps of the digitisation workflow shown in Fig.
OCR is a technology that allows the automatic recognition of characters through an optical mechanism or computer software (
We tested three off-the-shelf OCR software tools, described in Table
Microsoft's OneNote is a note-taking and management application for collecting, organising, and sharing digital information (
To evaluate the OCR performance of the aforementioned software tools, we ran two sets of experiments, one against the whole digital images of specimens and the other against the segmented images with an expectation that the latter would result in shorter processing time and higher accuracy. Indeed, the results shown in Table
The accuracy of OCR will be measured in terms of line correctness as described by
Bearing in mind the time and effort involved in creating the gold standard, only a subset of the dataset (250 specimen images and their segments) available for testing was used to evaluate the correctness of the OCR. Five herbarium sheet images, their segments and manual transcriptions, and OCR text used in these experiments can be found in Section 2 of Suppl. material
Apart from ABBYY FineReader Engine all other tools recorded an accuracy around 70%, with Tesseract 4.0.0 proving to be the most robust with respect to image segmentation. Its performance could be improved by further experiments focusing on its configuration parameters.
As mentioned in
ABBYY FineReader Engine 12.0 and Google Cloud Vision OCR v1 (
We performed an experiment to measure the HTR performance of both ABBYY FineReader Engine and Google Cloud Vision with respect to handwritten specimen labels. The five specimen whole images used in
The HTR results from ABBYY FineReader Engine and Google Cloud Vision were compared against the gold standard for each specimen image using Levenshtein distance (
One must be cautious when comparing interpreted gold standard data. For example, where the catalog number is "BM000521570" Google Cloud Vision finds "000521570 (BM)". Technically, Google Cloud Vision has found the correct string, but because the gold standard contains an interpreted value it appears that Google Cloud Vision is not correct. Another example concerns the fact that the gold standard contains fields that use abbreviations, such as country codes. This means that "Australia" and its country code "AU" will rightly be considered identical.
Specific fields were identified for HTR analysis: catalogNumber, genus, specificEpithet, country, recordedBy, typeStatus, verbatimLocality, verbatimRecordedBy. Verbatim coordinates are likely too complex or too often open to interpretation to be compared reliably in this analysis. For example, verbatimEventDate was ignored because it is not technically verbatim; it may be written “3/8/59” on a specimen label, but recorded as “1959-08-03” in a specimen database (
Note that typeStatus is not always present in a specimen image. It is therefore often inferred based on other data that is present. It was nevertheless included in the analysis because of its importance in biodiversity taxonomy.
Fig.
Examining the results in Fig.
In conclusion, this comparative test indicates that the results from Google Cloud Vision are of higher quality than ABBYY FineReader Engine. The results are of even higher quality when the lowest scoring categories are excluded. These results demonstrate that HTR can be used to retrieve a considerable volume of data of high quality. HTR should no longer be dismissed as ineffective because it has already become a viable technique.
Language identification is the task of determining the natural language that a document is written in. It is a key step in automatic processing of real-world data where a multitude of languages exist (
A number of off-the-shelf software tools can be used to perform language identification, examples of which can be seen in Table
Table
The program language-detection (
NER is commonly used in information extraction to identify text segments that refer to entities from predefined categories (
As mentioned in
According to
Both boundaries of a named entity and its type match. For example, the segment “Ilkka Kukkonen” in Fig. Two text segments overlap partially and match on the type.
Either way, the NER results are usually evaluated using the three most commonly used measures in NLP: precision, recall, and F1 score. In the context of NER, precision is the fraction of automatically recognised entities that are correct, whereas recall is the fraction of manually annotated named entities that were successfully recognised by the NER system. F1 score is a measure that combines precision and recall - it is the harmonic mean of the two.
Table
To evaluate the performance of NER on our dataset, we selected a subset of five herbarium sheet images and their segments, which are to be found in Section 3 of Suppl. material
Table
An improvement across all measures can be observed when using OCR text from segmented images. This is consistent with the increased line correctness observed described in
To improve the accessibility of a specimen collection, its content needs to be not only digitised but also organised in alphabetical or some other systematic order. This is naturally expected to be done by species name. The problem with old specimens is that the content of their labels is not likely to comply with today's standards. Therefore, matching them against existing taxonomies will fail to recognise non-standard terminology. To automatically extract species names together with other relevant terminology, we propose an unsupervised data-driven approach to terminology extraction. FlexiTerm is a method developed in-house at Cardiff University. It has been designed to automatically extract multi-word terms from a domain-specific corpus of text documents (
OCR text extracted from specimens in a given herbarium fits a description of a domain-specific corpus; therefore FlexiTerm can exploit linguistic and statistical patterns of language use within a specific herbarium to automatically extract relevant terminology. Section 4 of Suppl. material
Many scientific disciplines are increasingly data driven and new scientific knowledge is often gained by scientists putting together data analysis and knowledge discovery “pipelines” (
A scientific workflow consists of a series of analytical steps. These can involve data discovery and access, data analysis, modelling and simulation, and data mining. Steps can be computationally intensive and therefore are often carried out on high‐performance computing clusters. Herbadrop, a pilot study of specimen digitisation using OCR, demonstrated successful use of high performance digital workflows (
The tools that allow scientists to compose and execute scientific workflows are generally known as workflow management systems, of which
Apache Taverna is open-source and domain-independent (
Taverna was successfully deployed within the domain of biodiversity via BioVeL - a virtual laboratory for data analysis and modelling in biodiversity (
Taverna supported BioVeL users by allowing them to create workflows via a visual interface as opposed to writing code. Users were presented with a selection of processing steps and can “drag and drop” them to create a workflow. They could then test the workflow by running it on their desktop machine before deploying it to more powerful computing resources.
Kepler is a scientific workflow application also designed for creating, executing and sharing analyses across a broad range of scientific disciplines (
Like Taverna, Kepler provides a graphical user interface to aid in the selection of analytical components to form scientific workflows (
Tools like Apache Taverna and Kepler can be used for creating workflows for OCR, NER, and IE, like that depicted in Fig.
We designed a modular approach for automated text digitisation with respect to specimen labels (Fig.
For the sake of brevity the appendices can be found in the supplementary document "
OCR Software Settings OCR Line Correctness Analysis Data NER Analysis Data Non-standard Terminology Extraction Analysis Data
Contribution types are drawn from CRediT -
A range of specimens that demonstrate the wide taxonomic range of specimens encountered in collections. They also demonstrate the diversity of label types, which include handwritten, typed, and printed labels. Note the presence of various barcodes, rulers, and a colour chart in addition to labels describing the origin of the specimen and its identity.
Herbarium specimen (
Pinned insect specimen (
Microscope slide (
Fossilised animal skin (
Liquid preserved specimen (
A possible semi-automatic digitisation workflow to extract data from the labels of collection specimens.
The criteria used by each contributing institution to select a test set of 200 herbarium specimens. We did not attempt global coverage but instead aimed at a representative sample from BR=Brazil, CN=China, ID=Indonesia, AU=Australasia, US=United States of America, and TZ=Tanzania.
An example of specimen labels. 1=Title, 2=Barcode, 3=Species name, 4=Determined by and date, 5=Locality, 6=Habitat and altitude, 7=Notes, 8=Collector name, specimen number, and collection date.
An impression of the different challenges presented by specimen image segments. 1=Label with both printed and handwritten text, 2=Printed label oriented vertically, 3=Barcode composed of irrelevant characters, 4=Colour chart containing no text, 5=Ruler containing no useful text.
An example of an instantiated Darwin Core record.
Measuring OCR accuracy.
Specimen source: NHM Data Portal (
Comparison of Levenshtein distance scores for ABBYY FineReader Engine and Google Cloud Vision for selected fields, Levyear>0 excluded.
A summary of the Levenshtein distance scores for different label elements from handwritten text recognition using ABBYY FineReader Engine. HTR results are compared to label data interpreted by humans.
A summary of the Levenshtein distance scores for different label elements from handwritten text recognition using Google Cloud Vision. HTR results are compared to label data interpreted by humans.
The distribution of languages across the specimen and herbaria. EN=English, FR=French, LA=Latin, ET=Estonian, DE=German, NL=Dutch, PT=Portuguese, ES=Spanish, SV=Swedish, RU=Russian, FI=Finnish, IT=Italian, ZZ=Unknown. The codes for the contributing herbaria are listed in Table
An example of a specimen label used in named entity recognition. The output of the process is presented in Fig.
Gold standard versus NER output of the label in Fig.
Contributing institutions and their codes from
Institution | Index Herbariorum Code | ICEDIG Partner |
---|---|---|
Naturalis Biodiversity Center, Leiden, Netherlands | L | Yes |
Meise Botanic Garden, Meise, Belgium | BR | Yes |
University of Tartu, Tartu, Estonia | TU | Yes |
The Natural History Museum, London, United Kingdom | BM | Yes |
Muséum national d'Histoire naturelle (MNHN), Paris, France | P | Yes |
Royal Botanic Gardens, Kew (RGBK), Richmond, United Kingdom | K | Yes |
Finnish Museum of Natural History, Helsinki, Finland | H | Yes |
Botanic Garden and Botanical Museum, Berlin, Germany | B | No |
Royal Botanic Garden Edinburgh, United Kingdom | E | No |
A summary of specimen properties. The Names and Index Herbariorum codes for the contributing herbaria are listed in Table
|
|
|
BR | 47 | 49.0% |
H | 77 | 21.3% |
P | 45 | 42.3% |
L | 64 | 22.0% |
BM | 59 | 32.8% |
B | 61 | 50.1% |
E | 54 | 68.0% |
K | 79 | 17.8% |
TU | 26 | 62.2% |
|
|
Comparison of selected OCR software tools.
|
|
|
|
|
|
|
|
1985 | 4.0.0 | Apache | Windows 10 | Mac OS X |
Ubuntu 18.04, 18.10 |
|
1989 | 12.0 | Proprietary | Windows 10, 8.1, 8, 7-SP1 | Mac OS X 10.12.x, 10.13.x | Ubuntu 17.10, 16.04.1, 14.04.5 |
|
2012 | 17.10325.20049 | Proprietary | Windows 10, 8.1 | Mac OS X, 10.12 or later | Ubuntu 18.04, 18.10 |
Processing times for OCR programs using whole images and segments.
|
||||
|
|
|
|
|
|
01:06:05 | 00:45:02 | -00:21:03 | -31.9% |
|
00:50:02 | 00:23:17 | -00:26:45 | -53.5% |
|
01:18:15 | 00:29:24 | -00:48:51 | -62.4% |
Line correctness for OCR using whole images and their segments.
|
|
|
|
|
72.8 | 75.2 | +2.4 |
|
44.1 | 63.7 | +19.6 |
|
61.0 | 77.3 | +16.3 |
|
78.9 | 65.5 | -13.4 |
Language identification software tools and their properties.
|
|
|
langid.py | Open Source | University of Melbourne |
langdetect | Apache License Version 2.0 | N/A |
language-detection | Apache License Version 2.0 | Cybozu Labs, Inc. |
Example of langid.py usage with fragments of OCR text. Output lines denote the language identified in the input text and the probability estimate for the language.
Confusion matrix for predicted and actual labels.
|
|||
|
|
||
|
|
True Negative | False Positive |
|
False Negative | True Positive |
NER performance on OCR text retrieved from whole images.
|
|
|
|
|
0.81 | 0.38 | 0.69 |
|
0.71 | 0.21 | 0.53 |
|
0.76 | 0.27 | 0.60 |
NER performance on OCR text retrieved from image segments.
|
|
|
|
|
0.85 | 0.43 | 0.74 |
|
0.74 | 0.50 | 0.69 |
|
0.79 | 0.46 | 0.71 |
A summary of recommendations.
|
|
|
Optical Character Recognition | Tesseract 4.0.0 | Robust with respect to image segmentation |
Handwritten Text Recognition | Google Cloud Vision | Supports 56 languages |
Language identification | langid.py | Supports 97 languages |
Named Entity Recognition | Stanford NER | A wide variety of entities recognised including location, organisation, date, time, and person |
Terminology extraction | FlexiTerm | Robust with respect to orthographic variations (such as those introduced by OCR) |
Appendices
text, images
For the sake of brevity the Appendices can be found in this supplementary document. The document contains the following principal information concerning the Digitisation Experiments:
OCR Software Settings OCR Line Correctness Analysis Data NER Analysis Data Non-standard Terminology Extraction Analysis Data
File: oo_425114.pdf