Technical capacities of digitisation centres within ICEDIG participating institutions

DiSSCo, the Distributed System of Scientific Collections, is seeking to centralise certain infrastructure and activities relating to the digitisation of natural science collections. Deciding what activities to distribute, what to centralise, and what geographic level of aggregation (e.g. regional, national or pan European) is most appropriate for each task, was one of the challenges set out within the EC-funded ICEDIG project. In this paper we present the results of a survey of several European collections to establish current digitisation capacity, strengths and skills associated with existing digitisation infrastructure. Our results indicate that most of the institutions surveyed are engaged in large-scale digitisation of collections and that this is usually being undertaken by dedicated teams of digitisers within each institution. Some cross institutional collaboration is happening, but this is still the exception for a variety of funder and practical reasons. These results inform future work that establishes a set of principles to determine how digitisation infrastructure might be most efficiently organised across European organisations in order to maximise progress on the digitisation of the estimated 1.5 billion specimens held within European natural science collections.


Introduction
This report summarises the technical capacities of digital centres within collection-holding institutions in the ICEDIG project. The longer term aim is to combine this data with some work previously completed on policy to produce underlying best practice policy for digitisation for the Distributed System of Scientific Collections (DiSSCo) Research Infrastructure (RI) for natural science collections.
The survey is intended to provide information that will help to demonstrate: • Specific areas of excellence within each institution. • Which collections have been prioritised for digitisation and why. • The resources that are available for digitisation of various collections across ICEDIG partner institutions.
A previous report (MS43) provided a summary of common policy elements within ICEDIG partner institutions, creating a dashboard of results that is available on the ICEDIG website (ICEDIG Project 2020). It highlighted a number of needs: • Proposals on how to introduce and streamline relevant policy towards a common research agenda. • Establishing a knowledge base on how policies are organised within collectionholding institutions. • Establishing where the responsibility lies in ensuring the correct policies are both in place and adhered to. • Establishing who has authority over enabling policy change within collectionholding institutions.
These points will need to be considered later when combining work on the technical capacity of digitisation centres, with establishing best practice policy for digitisation.
This report aims to identify strengths and capacity for digitisation within collection-holding institutions in the ICEDIG Project. We debated which methods were most appropriate to gather this information, as survey fatigue is potentially an issue across the ICEDIG partners. However, this tends to be a quick and potentially simple way of gathering information from institutions across Europe. For this reason, we decided to create a survey whilst bearing in mind the potential disadvantages of this technique. For example, questions and answers given can be misinterpreted or survey answers may lack detail or clarity.
All links referenced in this report were archived using the Internet Archive's Wayback Machine save page service on 02-07-2020.

Project Context
This project report was written as a formal Milestone (MS44) as part of Task 7.2 of the ICE DIG Project. It was previously made available to project partners and submitted to the European Commision as a report on 31 March 2019. While the differences between these versions are minor the authors consider this the definitive version of the report.
The original data were collected at the beginning of 2019 and do not necessarily reflect the current capacity of any institute included in the report.

Methodology
Due to the low number of institutions taking part in this task we combined the survey technique of data collection with an interview technique. We sent out a survey to six institutions: 1. Botanic Garden Mesie (APM) 2.
The Finnish Museum of Natural History (LUOMUS) 3.
The Natural History Museum, London (NHM) We refer to each institute using the abbreviated name given in brackets.
In addition, there have been reports and surveys completed previously with some information on technical capacity. Where possible, we used this existing information to make a first attempt to complete relevant parts of the survey. The initial approach taken for this analysis was the extraction of digital components from institutional strategy documents. Information on institutional priorities and digitisation capacity was extremely limited from these sources. The next focus was searching for relevant information in other recent EC project documents such as the SYNTHESYS+ proposal (Smith et al. 2019). However, this information was not provided specifically for this survey, and so was not suitably structured to answer many of our questions. To fill in the remaining gaps we emailed the survey to each institution, and provided the option of having a meeting to either go through the survey together or answer any remaining questions.

Survey Design
Interpretation can be a challenging topic when discussing digitisation and what is meant by fully digitising an object. For the purpose of this report and to create clarity, we decided to use the DiSSCo survey definitions for digitisation.

Results
Each of the seven collection-holding institutions amongst the ICEDIG partners completed the survey, provided via a link to the appropriate Google Sheet. While a meeting or phone call was preferred to support the process, this was not always possible due to a combination of time constraints and the logistics of gathering the relevant data. We recognize that potentially not all the information available within institutions was included due to members of staff being absent or information not being found within the allotted time for the survey to be completed. All institutions were extremely helpful, replying in a timely manner within the deadline and completing as much of the survey as possible.

Current Status
The 'Current Status' section of the survey was pre-populated as much as possible with data from the DiSSCo survey. Each institution was asked to update the current status of digitised collections, if new data was available. This section was divided into three sections (along with the percentages for total collections): • Number of specimens catalogued (i.e. records exist on in-house collection management system). • Number of specimens digitised (i.e. specimen information in collection management system with partly or fully transcribed labels). • Number of specimens fully digitised (i.e. specimen information in collection management system, with fully transcribed labels and images).
As previously mentioned, we used the DiSSCo definitions for all of the above. This remains a challenging subject, as interpretation of the terms ('catalogued', 'digitised' and 'fully digitised') can vary drastically and this was raised by most institutions.
Current status is summarised in Table 1

Workflow
In order to understand each institution's digitisation process, we examined workflows to better understand the different work streams and tasks involved in digitising collections. We wanted to see how each institution streamlines the process in order to facilitate the complexities of digitisation.

Where does your digitisation take place?
All responding institutions currently digitise in-house, although APM also employs external contractors to digitise specimens on-site, noting that this makes it easier to accurately track progress in the project and minimise the risk of damage in transporting specimens. Several institutions (Naturalis, MNHN, NHM) have in the past outsourced digitisation to contractors such as Naturalis for off-site digitisation of herbarium sheets, the former two at scale and the latter as a smaller pilot project. * records exist in in-house collection management system. † specimen information in collection management system with partly or fully transcribed labels. ‡ specimen information in collection management system with fully transcribed labels, and images.
Technical capacities of digitisation centres within ICEDIG participating ...

What collections are you currently able to digitise?
Each institution has a variety of different collections which in turn creates a variety of expertise when it comes to digitisation. All institutions can digitise herbarium sheets, showing considerable expertise in this area across ICEDIG partner institutions, but also emphasising the relative simplicity of the workflow. Similarly, six out of seven institutions have the facility to digitise microscope slides. MNHN selected all collection types listed in the survey, suggesting capacity to digitise in some form a wide variety of collections. RBGK also included additional collections that they are able to digitise that weren't explicitly separated in the list: 'mycology', 'seed', 'DNA and Tissue Bank' and 'in vitro'. LUMOUS added in the 'other' option the ability to digitise galls with stacking equipment.

What specimen handling techniques are used?
Two institutions use either automated or assisted positioning of specimens. APM use a template that guides technicians on where to position the specimen before imaging. LUOMUS uses automated conveyor belt techniques as well as manual. RBGK stated that it mainly uses manual handling techniques, however the microscope slide workflows are slightly more automated with the use of a Zeiss Axio Scan. MNHN, NHM and UTARTU all use manual handling techniques.

Do you currently have any high throughput digitisation activities going on?
Five out of seven institutions (APM, Naturalis, RBGK, LUOMUS and NHM) reported to currently be running high throughput digitisation activities. Naturalis are working on Table 2.
Institutional digitisation capabilities by collection/specimen type.
digitising bagged butterfly collections, RBGK are currently digitising 220 herbarium specimens per day and the NHM are currently working on high throughput projects for pinned insects, herbarium sheets and microscope slides. MNHN have finished their latest high throughput project, and have completed digitising their herbarium sheets.
What formal processes are used to prioritise digitisation? For example, is there a scoring system in place?
In the past, some institutions surveyed had designed formal prioritisation frameworks for collections digitisation. However, in some cases (NHM, APM) those institutions have not persisted with using them for various reasons. Naturalis have adopted a collection evaluation system that was developed by the Smithsonian which provides structured data to underpin the prioritisation process. A variant of this system ('Join the Dots') has been rolled out at the NHM, and is beginning to be incorporated into the prioritisation process.
Institutions which do not currently use a formal scoring process still use various criteria on a less formal basis to help them prioritise which collections to digitise (Table 3). APM are currently working on African and Belgian collections, as their scientists are working on these collections. The Belgian collections were also chosen due to being native, and the African collection due to their colonial past. RBGK align with their science strategy and/or their collections strategy. MNHN prioritise type specimens. In UTARTU it is the curators who make decisions on what to digitise, however they are currently in the process of formalising this procedure. At the NHM projects are assessed on a case by case basis, looking at a set of criteria, with external funding being the most important factor. Research benefits were a consideration for all seven institutions when prioritising collections to digitise, with funding also being a factor for six out of seven institutions. Public engagement was the least popular criterion with only Naturalis taking this into Table 3.
Prioritisation criteria for digitisation projects.
Technical capacities of digitisation centres within ICEDIG participating ...
consideration. This was followed by technical innovation (3 of 7 institutions) and curatorial benefit (4 of 7 institutions). RBGK added an additional criterion: the time required for specimen selection, due to how their collections are arranged. They noted that it is prohibitively slow to select material by collector or even country, so taxonomic grouping or large geographical regions are preferred.
What information resources are available to support digitisation activities?
Five out of seven institutions (APM, RBGK, Naturalis, LUOMUS and NHM) provide written guidelines for handling specimens, while only two out of seven (APM and Naturalis) have guidelines on mounting/rehousing specimens. In the latter case, LUOMUS noted that their guidelines were not written down, while NHM commented that mounting and rehousing is commonly carried out by curators rather than digitisers (and where not, digitisers receive specialist training prior to the project). MNHN do not have written guidelines, but note that for both activities any manipulation is carried out by specialized technicians.
Six out of seven institutions have other relevant guidelines for digitisation processes. APM have guidelines for transcription of label information and for in-house imaging. Naturalis have health and safety guidelines. RBGK have data entry and also imaging guidelines that are adapted for each project. LUOMUS have frequently asked questions, and guides to georeferencing and transcription. The NHM have guidelines on recording information and semi-automated ingestion of data into the collection management system.

Do you have mobile digitisation stations?
Only one institution, RBGK, reported having a fully mobile digitisation station. This includes an adjustable desk, Mac, camera with adjustable column and a one sided open box with LED lighting at the top. This mobile station can be moved close to wherever the collections are being digitised. Some institutions have digitisation equipment that could technically be moved if needed, however it would not be an easy process and so remains static.
How do you track movement of specimens to ensure they are returned to the correct location?
A variety of answers were given for this question. At APM, the external company they hire to digitise collections have their own tracking system that uses QR codes. In the APM herbarium storage facility, the cupboards are divided into pigeon holes (64 per cupboard). When a pile of specimens are taken out of a pigeon hole and placed onto a trolley, a sheet is added to the pile with a QR code. An identical sheet with the same QR code is left in the empty pigeon hole.
Naturalis have coded their storage locations. All digitised specimens have this code with their standard location in the registration system, so they are always returned to the correct location after digitisation.
At RBGK, when collecting individual specimens for digitisation, tags are placed in the cupboard where the specimen is stored in place of the specimen so that it can then be placed back in the correct location. There are also collection Excel spreadsheets that are completed by the digitiser. Here the name, region and folder information are recorded. If entire cupboards are being moved, the boxes of specimens are numbered for order. When there are new digitisers working, more qualified staff check that the specimens have been put back in the correct location.
LUOMUS have a drawer number tracking system. Curators at UTARTU, similar to RBGK, place markers on specimens to show to where it should be returned. MNHN have a tracking system (details on this system was not provided). The NHM were the only institution not to have a tracking system however specimens taken from storage are returned on a regular basis and most collections are indexed based on their taxonomic name or have location labels.

Which protocols for barcoding are used in specimen digitisation?
Six out of seven institutions used barcodes in some way, with UTARTU being the exception. At APM every specimen (even if there is more than one) on each herbarium sheet gets a barcode, and similarly at Naturalis each digitised object is given a barcode.
At RBGK, most specimens are barcoded. Microscope slides are manually barcoded before digitisation, then the software used during the digitisation process automatically reads the barcode and creates filenames containing these barcodes for images produced. Economic botany and spirits are not barcoded but numbered. Not all new herbarium accessions are barcoded however all new fungarium accessions are databased and there is also software currently being developed to print out barcodes on fungarium labels.
LUOMUS use encoded CETAF stable identifiers. MNHN have protocols for barcoding botany, QR codes for entomology and marine invertebrates. NHM have protocols for barcodes including generating and purchasing labels, attaching them to specimens and reading/recording them into the CMS.

Quality Assurance policy/standards in place?
Four out of seven institutions reported that quality assurnace (QA) policy or standards were in place on some level. APM were the only institution that did not report any conditions for QA being implemented. Circumstances that affected the degree of QA included the particular demands of the project (Naturalis) and resources available (RBGK), for both images and transcribed data.
RBGK also added that QA guidelines are defined on a project by project basis, and tailored to the size of each project. Checks by digitisers and peer reviews are carried out with an aim to check about 5% of work completed for a project. However, constraints of project demands and funding can again limit the amount of QA activities.
NHM conducts QA on images at a basic level, mostly looking at file names along with some random checks. For transcribed data, controlled lists are generated where possible along with manual checks of the data.

Image elements included when digitising
Three out of seven institutions include all of the five elements (colour calibration chart, scale bar, labels, barcodes, institute name) when digitising specimens (

How do you track time and costs of digitisation workflows?
Five out of seven institutions reported using some kind of method to track time and cost of collection digitisation. Naturalis have team leaders for each digitisation team, whose responsibility it is to monitor and report results and capacity. At RBGK, digitisers fill out progress reports and flexi timesheets which record each day's activities along with timestamps on data entry, where possible. RBGK are currently using Toggl to categorise activities to see how long each section of the workflow takes. At LUOMUS they track time and cost on mass digitisation lines using logbooks. NHM also manually enter time and cost in spreadsheets, and MNHN complete a monthly evaluation of production by unit which is used in their global annual report.

How can digitised collections be accessed by researchers or the general public?
All institutions have made their digitised collections easily accessible to researchers and the general public. The main insitutional or national online collection is listed first, followed by their insitutional contributions to other large aggregator sites: APM • Institutional Collections: www.botanicalcollections.be • GBIF: https://www.gbif.org/publisher/061b4f20-f241-11da-a328-b8a03c50a862 • Genesys: https://www.genesys-pgr.org/partners/ 39331cc7-91d0-4af7-8bb1-46824864c1c8 MNHN have a crowdsourcing programme called Les herbonautes which has been operational since 2013. It was developed for the transcription of herbarium labels and is now also used for paleontology.

Do you support Digitisation on Demand requests?
Six out of seven institutions support Digitisation on Demand requests. At APM, if there is a request for a physical loan and the specimens have not yet been digitised, they are first digitised so that the researcher can select from them for the physical loan. At the RBGK they have a maximum of 10 images per request, and specimens requested are logged through a loan management system. LUOMUS supports Digitisation on Demand but currently has no formal procedures in place. At MNHN Digitisation on Demand requests are processed on a web demand management tool (MNHN collections -Requests) and validation is completed by the collection manager. NHM do not currently support Digitisation on Demand, but are actively working on it.

Do you have one or more specialised digitisation teams and what number of dedicated digitisers do you currently have?
Six out of seven institutions report having a specialist digitisation team. Naturalis have two team leaders who are dedicated digitizers, and NHM has a specialised digitisation team consisting of five personnel. UTARTU do not have a specialised team or dedicated digitisers.
APM have their technicians complete two hours a week (ca. 150 images) of digitisation. They also have 13 dedicated digitisers, including 10 technicians, and three volunteers who each spend roughly three hours imaging per week. MNHN have specialised digitisation teams consisting of a systematic collection management team, and specialised platforms for 3D and CT scans. Within this team there are 4 personal linked to 3D platforms and CT scans and 80 technicians in the collection management team.

What are staff trained in?
Staff working on digitising collections are a valuable resource. We wanted to briefly look at the main skill set staff working within digitisation have (see Table 5). The skills of staff will obviously be tailored to what collections institutions hold. The area most covered by training was photography with six out of seven institutions having staff trained in this area. This was followed by scanning with five out of seven institutions having staff trained in this area. Only two out of seven institutions had members of staff trained in machine learning. RBGK added another area not listed in the survey which was training on R and SQL queries along with how to test new products and tools introduced. For example, a new platform such as Transkribus.   (Hudson et al. 2015). RBGK reported that they have developed all of their collections management systems in-house, but are now looking to combine these and will probably not take the bespoke development approach for this. Table 7.
Software used as part of digitisation workflows.

Discussion
This survey provides an overview of the digitisation capacities of the participating institutions, but potentially reveals some interesting trends in addition to confirming information that might be well-known anecdotally.
The results reveal that most of the institutions are actively engaged in digitisation at scale, and that there is a strong current tendency towards on-site digitisation, even when employing the services of an external contractor. This suggests that all centres have some suitable space on-site for mass collection digitisation, but further conversations would be needed to assess how much those activities could be scaled up in those locations. It was also notable that outsourcing, both currently and historically, centred almost entirely on the digitisation of herbarium sheets (Le Bras et al. 2017, Anonymous 2016. As one would expect, herbaria are more focused on a smaller number of collection workflows (herbarium sheets and slides), with expertise and equipment more relevant to 2D imaging of flat objects. Although a number of institutions reported the capability of digitising most of the different kinds of collections, any existing and previous high-throughput digitisation projects reported were related to herbarium sheets, microscope slides or pinned (or bagged) insects.
The survey also indicates that centralised, specialist digitisation teams have become the norm, rather than relying purely on ad hoc databasing and imaging by curatorial and research staff. In some cases, this was a fixed, core team, and in others the number of digitisers varied depending on current projects. Barcoding specimens has also become standard practice across the surveyed institutions, at least for centralised, large scale digitisation projects. Tracking locations by electronic methods appears to be lagging behind somewhat, but there are a number of good systems employed within various institutions to act as references for others.
There was in general a reported lack of formal, structured methodologies for prioritising digitisation across the surveyed institutions. Some had attempted these in the past but not managed to embed them into standard practice. However, on an informal basis, many of the same criteria are being used in practice to prioritise projects. This suggests that a common framework, if suitably flexible and pragmatic, could be developed to assist institutions and digitisation centres in this work. Similarly, it's a significant task to write the various documents and guidelines to support digitisation workflows, and the gaps displayed in this data confirm that initiatives to start sharing these resources more widely are well-founded. This extends to better documentation on the software used in digitisation workflows, including the underlying business decisions on whether to develop in-house software or use off-the-shelf or open source solutions. It would be useful for other organisations to have an up-to-date list of recommended software used in digitisation workflows and how it is being used.
The survey data also suggested that quality assurance (QA) policies and standards are commonly lacking, minimal or ad hoc. This is an element of digitisation workflows which appears to be commonly under-resourced and lacking in expert input. Some of these issues are discussed in more detail by  and .
As has been the case for many previous surveys and conversations related to collections digitisation, one of the challenges has been to provide clear definitions for terms such as 'catalogued', 'digitised', and 'mass' or 'high-throughput' digitisation. While we aimed to be consistent as much as possible, there is still some room for interpretation which may have been reflected in some of the responses. However, we expect that wider feedback on the survey results should help to resolve some of these inconsistencies.