Research Ideas and Outcomes :
Research Article
|
Corresponding author: Karen M Thompson (karen.thompson@unimelb.edu.au)
Academic editor: Editorial Secretary
Received: 24 May 2023 | Accepted: 18 Jul 2023 | Published: 18 Aug 2023
© 2023 Karen Thompson, Joanne Birch
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Thompson KM, Birch JL (2023) Mapping the Digitisation Workflow in a University Herbarium. Research Ideas and Outcomes 9: e106883. https://doi.org/10.3897/rio.9.e106883
|
Specimens or objects in natural history collections hold substantial research and cultural value that is enhanced where these items are made digitally available. Benefits of digitisation include increasing open access to collection-based biodiversity data, increasing productivity of scientific research, enabling novel research applications of digitally accessible data, reducing preservation requirements through reduced object handling, and expanding potential for “remote curation” in collections. However, the time available for object and data digitisation is limited for most collections. Well documented digitisation workflows can ensure that curation time is efficiently applied to achieve digitisation outputs, and that digitisation standards are consistently applied within and among projects.
While this case study focused on the generation of digitisation workflows in a medium-sized Australian university-based herbarium, the findings of this study are relevant to collections globally. The curation workflows comprise a set of modular steps required for the digitisation of herbarium specimen data and images. Steps are clearly identified as requiring human-mediation versus those that can be automated, those that require on-site versus remote-access, and those that require transfer or transformation of data or files. This clarity enables consideration of the opportunities and challenges for increasing efficiencies for collection-based digitisation, data and file management. The maps provide a contextual framework for herbarium-based digitisation pathways for those who work with specimen-derived biodiversity data, and an insight into these tools for those who are not familiar with herbarium protocols.
collection management, curation, digital extended specimen, digital imaging, digitisation, herbarium, workflow
The key arguments for effective digitisation of herbarium specimen sheets are the same as those for all natural history and cultural material collections – specimens or objects can provide greater research and social value, while their physical integrity is better protected for future applications, if they are readily available in digital formats (e.g.,
Digital resources increase and enable open access to biodiversity data as per the FAIR principles (ensuring data are Findable, Accessible, Interoperable, and Reusable;
While the benefits of digitisation are widely recognised, the costs of digitisation in terms of labour and resources are considerable. In almost all collections resource availability for digitisation, and therefore digitisation effort, can ebb and flow; priorities follow funds, staffing levels can be variable, and momentum for digitisation projects may be intermittent. Digitisation standards must remain high and be consistently applied within and among projects. This requires that protocols are well documented, and that staff, despite turnover, are well-trained and consistently apply established curation and digitisation protocols. Small or medium size collections are often heavily reliant on a volunteer workforce and may integrate both in-house and outsourced digitisation initiatives, necessitating data and imaging transfer and file format compatibility across software. Digitisation workflows must be flexible and adaptable (without compromising quality), for those workflows are regularly revised and further optimised as obstacles arise and are mitigated, and as best practices evolve. These apparently conflicting requirements are more effectively achieved when digitisation workflows are well documented, contextualised, and understood.
In this paper we share the output of mapping the digitisation workflows efforts at the University of Melbourne Herbarium (IH herbarium code: MELU). Of particular interest to us was the identification of impediments to workflow efficiencies, where this workflow was situated in relation to other workflow descriptions in literature, and developing an understanding of the extent to which image and data collection relies on the physical involvement of a human. For MELU, mapping the curation workflow for digitisation was done in part to streamline the digitisation workflow, identify bottlenecks, and to identify risk points in the data management pipeline for future attention and mitigation. Our intent in sharing this workflow is to contribute the real-life experience of a medium-sized collection to the literature, so other small and medium-sized herbaria may use this as a reference for reviewing or designing their own workflows. Such maps also act as a communication tool for securing resources to enable digitisation work. As
In the Background section we explore the last decade of workflows in the literature. We then introduce the MELU case study, describe the methodology used to build the workflow maps, and present and discuss the workflow diagrams. In the Discussion section we identify and discuss the similarities of the MELU workflow to others in the literature and the contributions of these streams to accurately representing the complexity of specimen digitisation. Finally, we consider the resources and technologies that are required to meet the increasing bioinformatic challenges associated with curation of specimen-associated digital objects and data.
A large proportion of specimens held in herbaria are dried and pressed plant samples, secured to archival card with labels attached, and stamps or handwriting present – the whole object will be referred to here as a ‘specimen sheet’. The majority of specimens are sufficiently two-dimensional that they can be photographed at a single focal depth. A smaller number of specimens include large, three-dimensional structures, e.g., storage roots, succulent stems or leaves, infructescences, or fruit that are not rendered two-dimensional during pressing. Digitisation of these three-dimensional structures, either attached to specimen sheets or held in separate carpological collections, requires the production of multiple images across a range of focal depths that are then combined to generate a single digital image (examples from MELU in Fig.
High-resolution images of a Banksia canei specimen (MELUD121102a) from the University of Melbourne Herbarium (MELU), (online.herbarium.unimelb.edu.au/collectionobject/MELUD121102a). © University of Melbourne, 2023.
The definition of ‘digitisation’ has shifted slightly over time. For clarity, here it is used to refer to:
The collection of digital representations of the physical specimen may be referred to as a "digital specimen" (
A workflow can be thought of as chain of “atomised and executable components with the relationships between them to clearly define a control flow and a data flow” (
Digitization workflows span across human mediated processes through data and computationally intensive automation where software tools and services are the actors and intersect field collection techniques, institutional accession policy, differences in curatorial practice among domains, and involvement of the general public in crowd-sourced methods. (
In their report for the Australian Museum,
Two years later, in their conference paper
In the same year,
By 2012, digitisation within collections was sufficiently established that the ALA published the Digitisation of Heritage Materials guidance (
A special edition of ZooKeys, with twelve papers, was published in July 2012 (eds. Vladimir Blagoderov and Vincent Smith): ‘No specimen left behind: mass digitization of natural history collections.’
ordered into three workflows (Fig.
In the 2013 paper by
The
The authors also note that “the vast majority of institutions included a full data capture step within their digitisation workflows” (p. 13), and that “the majority of institutions are still capturing full specimen metadata prior to the imaging step in their main digitisation workflows” (p. 18).
The authors caution that “broad disparities in digitization starting points, institutional infrastructure, curatorial practices, and precise digitization tasks among and within these groups focused on different taxa make the development of a single, consensus object-to-digitized-content workflow impractical” (p. 2).
Around this time the literature appears to shift from the workflows – settling on the object-data-image (Fig.
While the excellent paper of
With a few notable exceptions, most digitisation workflows available in the literature are generalised, and understandably so, for this facilitates their uptake and adaptation. Surveys or applications of workflows tend to focus on large institutions or conglomerates; and large-scale processes for flat sheet herbarium specimens appear to have converged on conveyor belt systems with manual transcription, such as that used by the National Herbarium of New South Wales in Australia which incorporates Picturae (digitisation) and Alembo (transcription) (
In this paper we present the detailed workflow paths for digitisation of MELU collection, as a real-life case study and contribution to the literature for medium-sized institutions. Though, echoing
There is no best approach for digitizing herbaria; there are multiple effective approaches. The needs and resources of large research herbaria with multiple type specimens and collections from many countries and multiple centuries differ from those of small herbaria serving a forest district or a teaching institution. … Adopting theoretically suboptimal procedures for digitization may be the best procedure if the resources needed for adopting a better procedure are not available.
Established in 1926, the University of Melbourne Herbarium (MELU) is the largest university herbarium in Australia, with an estimated 150,000 specimens. Taxonomic diversity spans plants and fungi, as well as historic botanical objects and artwork. MELU is a research and teaching collection, and the collection’s strengths reflect University of Melbourne academic expertise and teaching activities. Digitisation efforts at MELU commenced in 2003 with the establishment of a FileMakerPro (claris.com/filemaker) database that was accessible online (N. Middleton, pers. comm.) and ramped up significantly from 2014 with the transfer of these data into the Specify collection management system (Specify Collections Consortium, Lawrence, KS; specifysoftware.org) and subsequent digitisation efforts (G. Brown, pers. comm.). In 2012, the equipment and software for the generation of high-resolution specimen images and standard protocols for image production were provided to MELU through the JSTOR Global Plants Initiative, which enabled the generation of high-resolution digital images. In 2020, MELU transitioned from a local networked collection management system (CMS) accessible on-site in the Herbarium on the Parkville campus to a CMS hosted on a virtual machine which enabled access on-site or remotely.
Digitisation rates at Australian herbaria are high, partially as a result of digitisation efforts concentrating on Australian specimens during the 2000s to support the development of what is now the AVH. AVH was created in 2001 (
The University of Melbourne Herbarium Collection Online (online.herbarium.unimelb.edu.au) was created in 2018, recognising the previously untapped potential for increased access to and engagement with high-resolution specimen images, including to enable data reuse. Specimen data can be searched or browsed, georeferenced specimens are mapped, and plant features or the collector’s handwriting are visible in the high-resolution images. The Collection Online links directly to the Specify CMS to provide access to MELU data in real-time, to facilitate viewing, and enabling the downloading of the full-size high-resolution images (ca. 250 MB per image). The Collection Online has been pivotal for expanding access to the collection, with user statistics documenting consistent national and global use of this resource. MELU also provides all digitised material (data and specimens images) to the ALA – “a collaborative, digital, open infrastructure that pulls together Australian biodiversity data from multiple sources, making it accessible and reusable” (
The digitisation protocols employed at MELU have evolved over the 20+ year history of the endeavour. For data transcription, protocols follow the standards developed by Biodiversity Information Standards (TDWG; tdwg.org) and Darwin Core (DwC; dwc.tdwg.org). For production of high-resolution images, MELU images (refer to Fig.
MELU specimen sheets include the unique catalogue number of the format "MELU" followed by a letter, seven digits, and single letter, e.g., MELUD121102a. In line with the teaching remit of the University, MELU has a volunteer program that provides training in curation protocols and management of research associated with biodiversity specimens to approximately 25 volunteers annually. Student volunteers are significant contributors to MELU digitation efforts, which means that delegated processes must be carefully documented and detailed to ensure consistency in execution.
The workflow maps described in this paper were developed as an element of a collaborative project between MELU and research data specialists from the Melbourne Data Analytics Platform (MDAP) at the University of Melbourne. The initial intent for the mapping was to enable the MDAP team to understand the ecosystem within which a specific investigation (into possible methods for machine-reading specimen sheet label data) was situated. Understanding the connections to other elements is critical, especially when focussing on a singular ‘module’ of a digitisation workflow. Taking time to consider the broader context early in the process encourages forward-thinking, avoids developing the work in a direction that may limit future usefulness, and facilitates identification of potential extensions or reuses of components.
The suite of workflow maps detailed in later sections was the outcome of many conversations, over some months, between the MELU curator and a data specialist with limited herbarium domain knowledge. This was an exercise in trans-disciplinary collaboration, and the utility of the workflow depended on allowing time to develop a shared vocabulary. A key value of a non-botanist taking responsibility for drawing the workflow was that they asked questions to elucidate knowledge that could easily be presumed or remain within the mind of the expert.
The workflows were constructed initially as one large comprehensive map of the multiple curation pathways. It was built ‘naively’ – that is, no predetermined workflow was used as scaffolding or framing, but instead, the tasks undertaken within the digitisation process at MELU were discussed one-by-one and connections made between them. These tasks were then bundled into ‘modules’, based on natural break points, when the process could be paused without detriment. In this way, the ‘outline’ map was created. This mirrors the ‘grounded theory research methodology’ employed by
Verbal information about herbarium processes were translated into a diagram by the data specialist, and as that diagram iterated over many conversations it became a tool of discovery and mutual communication. As the team were working remotely, the communication tools and diagrams necessarily took a digital format. In retrospect, this work also incorporated ‘visual thinking’ methodology, which “rests on the intertwined relation between visual perception and cognition” (
Creating the detailed workflow maps for MELU met the original intent of situating a specific task into the broader herbarium landscape. It also led to other positive outcomes, including:
MELU digitisation practices currently follow three streams:
In drawing out the maps of each of the above pathways, MELU Digitisation Workflows are represented in several ways:
Legend of shapes used in MELU workflow detailed maps. Green elements (hexagons and upside-down wedge shapes) require human physical actions; yellow (the four shapes in the bottom line of the figure) are technology-driven elements; other colours and shapes allow for easy identification in the workflows.
These workflow maps may give the impression they are set in stone, but of course they are representations of evolving processes. Nor should they give the sense that digitisation is a one-off task – e.g., any time the nomenclature of a taxon is updated, if the identity of the specimen changes, or if the specimen requires conservation after damage, then the digital data record needs to be updated (with QA) and/or new images taken (processed, uploaded with QA, and archived). In maintaining accurate collection data, it is essential to maintain version control records to prevent divergence of data on the physical specimens and in the CMS.
The MELU digitisation workflow outline (Fig.
The details for Stream 1 digitisation workflow are included below (Fig.
Next, (1B; Fig.
There may be a gap in time between the data collection in Stream 1 and taking the high-resolution image/s (section H; Fig.
The initial task (2A) in Stream 2 (Fig.
Next, (2B; Fig.
The next steps are to set up for and engage in the manual transcription of the data from the specimen images (modules 2D and 2E, Fig.
The penultimate task in Stream 2 is 2F (Fig.
At the end of Stream 2, all collection data are uploaded into Specify CMS. The lower-resolution image/s are retained in collection records but are not uploaded into the CMS and are not publicly shared. In this way, this stream does not always immediately complete the entire digitisation workflow, as high-resolution images may not have yet been generated. The final decision regarding generation of high-resolution images suitable for online sharing is made based on collection curation priorities and staff/volunteer availability.
Increasingly, data is entering herbarium databases soon after collection via digital records kept by the collector. Stream 3 (Fig.
Many digitisation workflow diagrams observed in the literature do not explicitly distinguish manual or human-mediated versus automated or scripted workflow steps. The MELU outline (and subsequent detailed maps) makes explicit the human-mediated steps in digitisation workflow, particularly the regular and iterative handling of the physical botanical specimens. In the outline map (Fig.
Specimen handling events in the digitisation workflow include tasks such as selection and collation of specimens for digitisation, generation of lower- or high-resolution images, affixing label, and refiling specimens. The placement of these human-mediated steps, inferred, but rarely annotated as such in workflow landscapes, has significant implications in terms of efficiency in the digitisation workflows. Specimen handling tasks are typically labour-intensive steps and many, such as specimen selection and refiling specimens into the collection cannot be eliminated. However, reducing the number of times the specimens are handled during the digitisation curation workflow, for example by reducing the requirement for (re-)sorting or (re-)filing specimens, introduces efficiencies and time savings to the overall workflow. Good examples of the timesaving offered by reduction/s in specimen handling are the transition from a data-to-image to an image-to-data workflow and reliance on lower-resolution images, rather than high-resolution images, as a source for specimen transcription. The time requirement for generation of lower-resolution images is significantly less than that required for the generation of high-resolution images. Generation of digital images, albeit lower-resolution images, early in the digitisation workflow, enables sorting and searching of digital images, which can result in a significant timesaving over sorting and searching physical specimens. Generation of high-resolution specimen images is typically still a desired component of the digitisation workflow. Where lower-resolution specimen images are used for data curation, decisions regarding the allocation of resources and time to the generation of high-resolution specimen images can follow collection imaging priorities rather than data capture priorities. Where curation and digitisation resources are finite, as is the case in all collections, such efficiencies in the workflow can release staff time for other essential curation tasks.
The coronavirus (COVID) pandemic, during 2020 and 2021, provided the impetus for a pivot to 'remote' curation and taxonomic work (e.g.,
Any diagram, by its nature, is an abstraction of reality and may appear to imply that work simply flows from one task to the next. Representations necessarily omit detail and seem to suggest that connections are seamless. But it can be these very transitions, between modules and between tasks within modules, that can be the most difficult part of a digitisation project, for they often involve data format transformations, transfers between storage locations or software, which are time-consuming and may be points of highest risk for data loss. Additionally, workflows that require multiple format conversions between input and output data files are often not very resilient to workflow adjustments, which can limit the ease of maintenance and evolution of these workflows over time (
Detailed workflow maps permit the inspection of the technology required for each digitisation task and, subsequently, requirements for data transfer among software and storage or archival location/s. For example, the complexity of the CMS infrastructure, the software components, and the resulting curation steps involved in the MELU workflow around tasks H1 and H2 are detailed in Fig.
What is evident from these landscape maps is the complexity of data handling requirements for all digitisation workflows. Comprehensively mapped workflows, as provided here, clearly illustrate the complexity and labour-intensity of managing not only the collection objects and their primary data, but also any derived objects and metadata, while also maintaining the links among these entities (e.g., “digital-extended specimens” (
While this case-study has focused on digitisation of specimens and their primary data, well-documented digitisation landscapes such as those presented here can provide the necessary framework for subsequent mapping of workflow/s for digitisation of derived specimen objects and for in-house curation of specimen-derived research-associated data that are typically provided by researchers to third-party global repositories (e.g., GenBank, MorphoBank).
Data management represents an increasingly labour-intensive task for curators, which is a challenge for all collections, and in particular for small and medium size collections with limited curation staff. Mapping what currently remains a predominantly manual workflow enabled identification of the steps that hold potential for automation (e.g., H1; the scripted upload of images into the CMS and into the image storage archive). The increased availability of openly available workflows and software architectures with standardised interfaces that meet the information technology and archive requirements of natural history collections and that are customisable to meet the diverse needs of collections (e.g., Kurator,
For the time being MELU still relies on manual transcription of data from specimen sheets. As has already been noted, even the largest organisations (with arguably more funding) also appear to continue to invest in manual transcription. The ongoing engagement of citizen scientists for ‘remote’ elements of Stream 2 has been important for expanding opportunities for ongoing digitisation outputs at MELU. This approach is by no means intended as a replacement for mass-digitisation pathways seen in large institutions, but it is a ‘lightweight’ approach to transcribing from an image and making progress toward digitisation goals. It is suitable for small and medium collections because of this simplicity, and it is cost-effective to apply with minimal tool or technology changes. MELU is currently exploring what efficiencies may be introduced via machine-learning and -reading, believing that “information extraction from specimen labels [is] among the digitization workflow activities which can benefit from greater automation” (
Integration of the archival requirements of the vast amounts of data and digital files that are the result of digitisation efforts is also necessary to ensure these resources are curated and accessible across the research data life cycle. While current efforts continue to focus on the generation of the first set of digital files associated with physical specimens, ongoing study of those physical specimens may require the addition of an annotation label, for example, to denote a change in the taxonomic identity of the specimen or the sampling and removal of material from the specimen. Version control of images and data becomes even more complex when both specimen data and images are shared with global repositories. New functionality will be required to enable curation and version control of digital extended specimens given the dispersion of objects and their data into multiple databases and repositories and ensure that current and consistent versions of those objects are accessible for curation and use globally (
Time for object and data curation is a precious commodity in all natural history and cultural collections. These digitisation workflows have contributed to ensuring the efficient use of curation time to achieve digitisation outputs and that digitisation standards are consistently applied within and among projects. The time taken to create these workflow maps was substantial, and admittedly more than anticipated at the outset, in part because visualising the pathways was more complex than was initially appreciated. However, the time invested has been worthwhile; they have already contributed significant value to MELU collections. Curation pathways have been optimised as a result of the work required to construct and visualise the documented workflows. Workflow construction provided opportunities for comparison of specimen curation steps among digitisation pathways, which facilitated recognition of the similarities and resulting modularity of these workflows. Significantly, these pathways no longer only exist in the mind of one or two experts and are instead visually available for reference, consideration, and improvement by curation team members. Finally, these workflows have been, and will continue to be, effective tools for communication with stakeholders outside the herbarium. They have illustrated the contextual framework of curation workflows and tasks necessary for collaborations with research data specialists and computer programmers working on tool development based on MELU collection based digital resources including for scripted access for extraction, analyses, and provision of MELU digital resources and data. Additional infrastructure is required, particularly for small- to medium-size collections to meet the increasing demands for high-quality collection associated biodiversity data. We hope these workflows are useful for other herbaria, for comparison, or to serve as a launching point for further workflow optimisation or development.
The authors acknowledge Melbourne Data Analytics Platform (MDAP) colleagues also involved in the MELU-MDAP collaboration project: Emily Fitzgerald, Robert Turnbull, Simon Mutch, Noel Faux, Bobbie Shaban; School of BioSciences colleagues: Heroen Verbruggen and Andrew Drinnan; Royal Botanic Gardens colleague Niels Klazenga; and MELU staff member Aiden Webb. The authors acknowledge the University of Melbourne Botany Foundation and the Russell and Mab Grimwade Miegunyah Fund for their financial support for digitisation in the University of Melbourne Herbarium.