Corresponding authors: Alex Hardisty (
Academic editor:
There has been little work to compare and understand the operating costs of digitisation using a standardised approach. This paper discusses a first attempt at gathering digitisation cost information from multiple institutions and analysing the data. This paper has been written: for other digitisation managers who want to breakdown and compare project costs; as a potential baseline for future digitisation projects; as a starting point for prioritising research and development to reduce digitisation costs.
This report focuses on analysing the operating costs of digitisation and developing a standardised method for gathering cost information from partners within the Distributed System of Scientific Collections Project (DiSSCo
Each institution was asked to break down their digitisation costs into three categories: capital costs (equipment, cameras, workstations, etc.), fixed costs (space charges, depreciation, fixed-cost staff) and variable costs (labour costs based on time and throughput, consumables). Institutions were also asked to report on the number of staff, their throughput (the number of specimens digitised per month) and the time spent digitising a specimen. Costbooks were grouped according to the type of collection, which included herbarium, fungarium, palaeontological, spirit material, etc. However, some collections, such as vertebrates, only had one reported case while six costbooks were returned for herbarium collections. Thus costings are reliable for some collection types while other collections types will require further research and confirmation.
Digitisation costs varied according to several different factors. The most dramatic difference was between the cost of digitising different types of collections. Vertebrates and marine invertebrates were shown to be significantly more costly to digitise than herbarium and pinned insects. This may be due to differences in speed and efficiency gains that can be achieved with 2D or flat objects versus 3D objects, but is also indicative of the higher priority given to these collections types and the subsequent improved workflows that have developed over time compared to those collections that are being digitised in smaller numbers.
Cost variances were also reported within the same collection types. Multiple cases were returned for herbarium, pinned insects, microscope slides, paleontogical and fungarium collections but with wide variances in cost in some cases. One institution reported €3.89 PPS (Purchasing Power Standard) per paleontology item versus another that reported €28.28 PPS. Further data collection for collections types with a wide cost range may result in more normalised data. While the range was not quite as wide for collections that had a larger sample size, some institutions still reported double the cost per item than others.
The major contributor to these cost differences was staffing and labour which proved to be the largest cost component in all cases. However, no distinct correlation was found between the number of staff and the total annual throughput of specimens. An increase in staff numbers did not predict an increase in throughput. The throughput for a staff of one for herbarium and pinned insect collections ranged from approximately 20,000 to 130,000 specimens per year, indicating that the greatest efficiency gains are achieved through improvements to workflow rather than an increase in staff. However, more research is required on why such a wide range in throughput was reported and the specific differences in equipment and workflow that contributed to it.
Considering the complexities of the digitisation process, and its variability among institutions and between different types of collections, we conclude that time spent (and the associated labour costs) is an essential variable that informs cost. While this report should not be considered a forecasting tool for predicting anticipated costs, it does offer insight into which costs should be accounted for and where attention should be focussed to increase throughput and reduce costs.
This is the first attempt to gather and analyse the costs of constructing and operating the digitisation infrastructure of the DiSSCo project as a distributed infrastructure for digitisation. This deliverable report focuses on the operating costs of digitisation and standardising the gathering of cost information from DiSSCo partners.
In this report, we have incorporated the costbook methodology along with the completed institutional costbooks from the collection holding institutes within the ICEDIG Project. We have made a preliminary analysis of the completed costbooks that leads to some observations and recommendations. By harmonising approaches to gather costbook information and reporting gathered costs in terms of the European Union-wide ‘Purchasing Power Standard’ (
This project report was written as a formal Deliverable (D8.2) of the
The following text is the formal task description (Task 8.4) from the ICEDIG project's Description of the Action (workplan):
A variation (narrowing) of the scope of the task description was agreed with the project Coordinator (January 2019), focusing only on the costs of approaches to mass digitisation as practised across multiple museums and avoiding unnecessary overlaps with the work to be done in the DiSSCo Prepare project. This aligns with the objectives of ICEDIG to concentrate on looking at innovations/efficiencies of digitisation, whilst the broader costs of building/operating DiSSCo are better dealt with in the DiSSCo Prepare project; where there is a whole work package (WP4) on financial readiness, including costing of construction and operation. The present task must contribute what DiSSCo Prepare needs for its work on achieving financial readiness.
Following basic cost accounting principles, we identify several components of costs:
While outright purchase of equipment and space is most common, it is sometimes possible to lease assets for a period. The terms of any lease – in particular, whether there is an option to acquire the asset e.g., at the end of the lease – affect whether the cost is treated as capital or as an operating cost.
It can be helpful to consider the
When an additional specimen can be digitised for less than the average cost of all previous digitisations of specimens, economies of scale are being achieved. The aim of introducing automation, for example is to force the marginal cost below the long-run average cost, so that the latter eventually falls. Conversely, there may be approaches to digitisation – for example dealing with special requests - where marginal cost is higher than average cost. In this case, a consequence of handling increasing numbers of special requests is potentially higher average costs overall.
Costs of digitisation divide naturally into: i)
Nevertheless, different scenarios of digitisation, largely determined on whether digitisation is carried out in-house or outsourced, and at small versus large scale lead to different costs.*
Currently, most known digitisation initiatives fall into the in-house category, incurring capital costs for establishment and operating costs for running the facility. Some digitisation projects are undertaken on an outsourced/contract basis where a per item or total negotiated price is paid to cover the variable costs of digitisation, recoupment of contractor’s capital and fixed costs and provide a profit margin.
For the purposes of the present task we are mainly interested in the costs of establishing and operating in-house facilities but where possible to collect, it is also interesting to gather costs of outsourcing.
Establishing a digitisation facility largely consists of capital costs, although it can also include other associated costs. Establishing a facility may often be treated as a capital project with a definite beginning and end and can include planning and specifying what is needed, tendering and procurement of equipment and/or services, readying the physical space where the facility is to be located, installation and testing of equipment, and finally, acceptance of the facility. If the intended facility is small, it may be treated as a small non-capital project e.g., the purchase of a single computer and camera as a digitisation workstation. A digitisation facility can be semi-permanent i.e., needed for a substantial time (e.g., several years) as part of a large digitisation programme; or it can be temporary for a specific digitisation project, such as when a specialist company contracts to digitise a specific collection(s) over a short period (e.g., weeks or months).
In many instances, capital and other establishment costs can support more than one digitisation workflow or operation. For instance, a computer, scanner or camera can be used with a variety of different collections. Reaching costs per workflow or per item therefore requires an apportionment by (approximate or actual) time spent using the equipment in different workflows. Any reasonable apportionment that avoids double counting of costs or excessive loading of capital costs in a way that distorts per item costs in a single workflow should be acceptable.
The costs of digitising specimens and collections are operating costs. They must be considered as the result of a sequence of continuous or repetitive operations in a digitisation process that is performed to obtain digital object representations (i.e. digital specimens, labels, and/or collections of specimens like whole drawers, vials or palaeontological slabs) from physical objects, and the metadata that describes the digitisation process. We consider a digital object representation to potentially include transcribed data, analytical data (e.g., chemical, molecular) and data linked from other sources like literature. Cost units, which include components of both fixed costs (including depreciation of capital assets) and variable costs, must be averaged over the number of digital objects produced during the period needed to digitise.
It is clear there cannot be a single, common cost for digitisation. The fundamental differences of approach between digitisation-on-demand, project-driven digitisation and mass digitisation lead to quite different cost models. For a sense of this, just consider the different ways that just-in-time supply chains, cottage industries and automated factories operate. Costs can also vary depending on the level of digitisation desired (i.e., the sophistication: a bare level, a basic level, a regular level, or an extended level digitisation – as suggested by the proposed standard for Minimum Information about a Digital Specimen (MIDS)*
Digitisation occurs in different forms – by single specimen, by sub-part of a collection (e.g., tray of insects) – requiring different handling procedures and different digitisation approaches, according to the type of specimen. Herbarium sheets, which are almost two-dimensional and stored as sheets in folders and boxes are easily amenable to a high-speed approach involving a flat-bed conveyor and overhead camera. Pinned insects, on the other hand require more time-consuming mounting procedures and camera shots from multiple angles that are not just overhead. Spirit jars may need to be opened and emptied into a transparent tray and photographed from below, as well as above before being re-filled and sealed again. Retrieving a specimen from its storage, preparing/mounting it for digitisation, moving it through the process, repacking/preserving, and replacing it in cabinet/storage accounts (i.e., physically accessing and handling the specimen) accounts for almost all the cost of digitisation. Making the image(s) and databasing label information, even with the associated procedures of image processing, transcription and quality control is often not a substantial time-consuming element of the process and thus, not the largest part of the cost. Sometimes, opportunity is taken during digitisation to perform new conservation/preservation measures, such as re-mounting and re-labelling herbarium specimens. Such additional costs can complicate the picture, especially when the procedures are not applied for every specimen.
Digitisation processes can be separated into many discrete tasks performed. This has been shown by the analysis work of
Our five main activities of digitisation for cost gathering purposes are:
Digitising specimens has fixed costs and a variable cost component related to throughput.
Throughput is the amount of digitisation achieved (i.e., the number of specimens or collections digitised) in a given amount of time. It is determined by the maximum capacity (or bandwidth) of a digitisation line and the rate at which digitisation successfully proceeds. When digitisation is proceeding at a rate that exactly matches the bandwidth of the facility, then maximum throughput is achieved. In practice, facilities are seldom fully utilised, and rates of successful digitisation are often lower than the theoretical maximum. This can be due to many factors that can include, for example specimens not arriving at the facility fast enough, manual handling difficulties, faulty digitisation requiring rework, insufficient/non-availability of staff, inadequate training, the need for frequent recalibration, equipment faults and breakdowns, and other causes.
Optimising a digitisation facility to achieve maximum throughput in line with defined objectives for quality, time and cost is both a science and an art, requiring attention to continuous improvement of processes and to the prevention of defects. This is an extensive topic that DiSSCo must engage with to accelerate mass digitisation at acceptable cost.
The data preservation and access costs, which again have fixed and variable operating costs components, mainly arise after digitisation: What to do with the image taken? Which kind of archiving/storage option should be taken, knowing that the cost will depend on the size of data sets and the speed of mobilising them? Trying to view this from perspective of the user/customer, with the following example (user story): "I want to have access to all images of gastropods from Wales"; the two extremes of possible solutions to this are:
The images are stored on disk/tape in different institutions. Needed actions are look-up in the DiSSCo catalogue, retrieving the images from various institutions, and manually building up the set of images. This will take a few days labour (and that costs some money), but data infrastructure is simple and comparatively cheap to build/maintain. A coordinated, interoperable data infrastructure with petabytes of storage and petaflops of calculations and gigabytes broadband network. The request will take a few seconds/minutes and will perhaps be fulfilled by distributed query and aggregation. It will be simple to use but complex in operation and cost more to build and maintain.
DiSSCo should sit somewhere on this spectrum from largely manual to fully automated, considering the needs to be FAIR (
Again, costs for data preservation and access have capital, fixed non-recurring and recurring and variable components.
As noted, different types of collections have different requirements in terms of handling procedures and technical approaches to digitisation.
Initially we considered to adopt the storage classification proposed by
To complement work carried out on present technical capacities of digitisation centres within ICEDIG participating institutions (
A template for gathering information has been designed (Suppl. material
Gathered costs are adjusted to take account of the different purchasing power of money in different economies and represented for the EU as a whole. This adjustment is done using the Eurostat
Several approaches to implementing and maintaining the costbook have been considered, including:
Use of Excel spreadsheets; Another tool, like
In the first instance, gathering of costs has been carried out with a small number of collection-holding institutions that are beneficiaries in the ICEDIG project using an Excel spreadsheet template as first designed (Suppl. material
Alternative approaches such as Airtable can be adopted when either a larger number of institutions are asked to provide costs, and/or for budgeting purposes. To test this premise, a pilot workspace was set up in Airtable. The flat Excel template was partially normalised into a relational data structure, and calculated fields added to mirror the calculations in the Excel costbook. A small set of test data was entered into the Airtable tables, and results checked against the Excel template to confirm that calculations had been accurately replicated.
Data were originally received in the form of 22 completed template worksheets (Suppl. material
A manual process was also used to create a set of descriptive field names for the 82 data fields in the template and to map each field to the row and column of the relevant cell in the template. For future reference, allocating named ranges to the cells when creating the original template would have negated the requirement for this manual step. This is a modification that we propose should be made before the templates are used again.
A short Visual Basic for Applications (VBA) procedure was written and executed to extract the data (Suppl. material
The data were manually transposed into a standard table format, with one column per data field. A pivot table was created using the flattened table as the data source to provide some support for dynamic analysis and visualisations.
The
Of the seven institutes surveyed, six (APM, RBGK, LUOMUS, MNHN, UTARTU and NHMUK) returned at least one completed costbook. Of these seven institutes, two are herbaria and five are general natural history museums. A total of 35 costbooks were returned (Suppl. materials
Of the eight collection types, two have widely established, mature workflows with costs: herbarium sheets and pinned insects. Herbarium sheets have long been ahead of the other preservation/collection types in terms of established methodologies and protocols with international projects such as the JSTOR Global Plants Initiative (
As can be seen from Fig.
A recent ICEDIG study of state-of-the-art approaches to mass imaging of liquid samples, which covers spirit material, concluded that mass digitisation for these collections is currently unfeasible hence the lack of mature workflows (
Microscope slide digitisation was also the subject of an ICEDIG report. While mass imaging approaches have been developed and shared (
The remaining collection types (Anthropological, Palaeontological, Mineralogical and non-insect Invertebrates) were not included in the scope of ICEDIG digitisation research. While non-insect invertebrates are a major collection type, they were accidentally omitted from the scope of
As illustrated in Table
Establishment costs are highly variable as is their effect in overall annual digitisation costs. Detailed breakdowns and descriptions of equipment purchased were not given for most of the costbooks, whereas in several cases some additional information was given indicating that costs also included computers, printers and other ancillary equipment. This makes it hard to understand what the costs really cover and the variations between institutions. Because of this the numbers mask differences in the kinds of equipment purchased so comparisons can be made only cautiously.
In the case of herbarium digitisation, the gathered costs mainly relate to equipping a single workstation; yet in one case it is known that an automated conveyor system was included, and in another case, it is known that a high-capability/resolution scanner was purchased. Nevertheless, the average and median costs are similar, with a range of €26,000 – €38,000 PPS as a typical workstation cost. When an integral conveyor system is included, the cost is higher.
Pinned insect lines show a greater variability across the range of reported establishment costs. Insect lines are one area subject to much recent innovation in attempts to increase throughput, and thus a greater variety of novel equipment solutions have been purchased and tried. It’s not possible to give a typical cost for establishing a pinned insect line, except to say that for static (low throughput) solutions the equipment costs are typically low – basically a few thousand PPS for camera(s) and lighting, whereas introducing automation via a conveyor system for higher throughput substantially increases costs (by an order of magnitude).
For several digitisation capabilities, insufficient data was returned to give any credible picture of establishment costs for other collection categories. One outlier worthy of note is a setup composed of a specialised fluorescence/brightfield slide scanner and research microscope for digitisation of microscope slides. This cost more than €150,000 PPS.
In common across all institutions and regardless of digitisation workflow/capability is the observation that establishment costs focus almost solely on equipment purchase and to a lesser extent on costs of acquisition and upgrade. Few non-equipment elements of the expected costs of establishment – such as building/workspace renovation costs, new furniture, electrical work, etc – were reported. This suggests either that such costs are not frequently incurred or (more likely) that such costs are unknown or cannot be accurately accounted for after the fact.
Space requirements for equipment range from 10m2 – 65m2 with average and median of 29m2 and 25m2 respectively. 15m2 – 20m2 seems to be a typical amount of space needed for these kinds of digitisation facilities, with conveyor systems needed larger spaces.
Finally, depreciation periods for such equipment are typically stated as 5 or 7 years, indicating that respondents consider this to be a reasonable lifetime for such investments (even if actual lifetimes are sometimes longer).
Establishment costs are one-off costs, normally funded out of capital budget, infrastructure development or project grants. Depreciation is therefore used as an element of the fixed costs calculation to give a truer reflection of the actual cost of digitising specimens. Depreciation costs vary, depending on the original establishment cost and the chosen depreciation period.
Fixed costs are unrelated to the volume of specimens digitised. No matter how high or low are the rates of digitisation, fixed costs remain the same. Table
Fixed staff cost made up the largest percentage of total fixed costs. Some institutions factor staff into fixed costs (e.g. NHMUK where digitisation staff are largely on long term contracts) while others consider it a variable cost depending on the finance structure that supports the role. Every institution reported fixed-term staff except for RBINS and every institution reported variable cost staff except for the NHM. Among the institution that report fixed cost staff, the average number of staff was 0.84 with a maximum of 2.5 and the total annual labour cost ranged from €1,798 – €124,025 PPS, the highest case of which was MNHM’s outsourced workflow for ReColNat.
There were two sources of variable costs that were measured in this analysis – variable cost labour and the cost of consumables. Table
Where labour is considered a variable cost, it makes up a significantly larger percentage of variable costs than consumables (although the potential for double-counting should be taken into consideration). Labour costs were calculated by number of staff, their average gross monthly salary and the length of their working week. The average number of variable-cost staff (excluding the NHM who reported none) among the remaining workflows was 1.54, with a maximum of 4, indicating that it may be more feasible for many institutions to employ variable-cost staff than a team of full-time fixed-cost staff.
For the fix institutions with variable-cost staff, total annual fixed labour cost ranged from €18,727 – €123,264 PPS. One of RBGK’s workflows included national insurance payments and superannuation into their calculations and was removed from this analysis due to its incomparability to other workflows.
The cost for consumables per batch of 100 objects (single specimens or containers) ranged from zero to €54.49 PPS. The specific consumables used for each project were not named in every case, so it is not possible to identify precisely what the costs are or the reason for this wide range in consumables cost. The two reported cases of fungarium had a much higher cost for consumables than other specimens (Fig.
Fig.
Direct comparison of the reported rates of digitisation between institutions is not possible as each has different setups and team compositions, as illustrated in Table
These differences in workflow and the level of capture can be seen in the throughput within specimen groups. After removing the single case of automated outsourcing due to its exponentially higher throughput, the remaining 22 workflows showed a wide range of throughputs where more than one case was reported, particularly for microscope slides and pinned insects (Fig.
Institutions also vary in the number of staff dedicated to digitisation, ranging from 0.1 to 4.8 people. As labour makes up the largest percentage of digitisation costs, it is important to understand labour’s impact on throughput. Contrary to expectations, a larger staff did not necessarily result in a linear increase in throughput. (Fig.
Herbarium specimens showed a slight association between team size and throughput. However, the throughput of pinned insects varied widely on teams of one from 1,737 to 114,700 specimens annually, with the largest team of 3.8 returning the smallest throughput. While semi-automated processes did tend to show a higher throughput, the two cases of manual processes for pinned insects showed a throughput of 21,818 and 1,736. While the one case of an herbarium semi-automated workflow did yield one of the highest throughputs (52,800), the highest was a manual workflow (62,400).
These differences may be due to the depth of information collected in the digitisation process. While it is hard to make direct comparison with workflows, both LUOMUS and NHMUK have developed high throughput workflows for pinned insects (
The time required to digitise a batch of 100 objects (single specimens or containers) is affected by multiple factors, including: layout of the institutions, storage facilities, equipment available, etc. There were 18 reported cases of time spent across all specimen types –NHM and RBINS did not provide any time data. The median hours spent digitising 100 objects was 9.88 and ranged from 2.10 to 217.67. RBGK’s microscope slides, the high outlier, are exponentially more time consuming than any other specimen type and was removed from further analysis.
The two palaeontological cases had a wide range, with one requiring 41.67 hours per 100 objects and the other double at 83.33 (Fig.
Time was also estimated for each stage of the digitisation process – curation, image capture, image processing, data capture and preservation. In general, curation was the most time-consuming step in the process across most projects and specimen types (Table
In order to assess the cost per item, an RBGK project that included national insurance and pension payments in their cost analysis and their case of microscope slide digitisation which had an exponentially higher cost per item than all other cases (€381.26 PPS) was excluded, as well as an UTARTU case that did not provide cost data. This left 19 cases.
The median cost per item across all cases was €2.10 PPS, ranging from €0.53 PPS to €34.22 PPS. Again, the range between the two cases of palaeontological digitisation proved to be the widest while pinned insects and herbarium were relatively consistent. The median cost per item for herbarium was €2.78 PPS and for pinned insects was €1.06 PPS (Fig.
In the two cases where the digitisation process was fully automated – MNHN’s outsourced ReColNat workflow and UTARTU’s palaeontological collection – cost per item was reduced considerably (Fig.
While six out of seven institutions returned costbooks categorized by specimen type, RBINS return costbooks categorized by method of digitisation and size of the item being digitised. While this makes it difficult to compare with other institutions, it does provide insights into different aspects of digitisation costs by showing which methods of digitisation are more costly than others.
For example, 3D imaging is the most expensive digitisation method and with a very low throughput offset by the quality of the image captured. Transcribing metadata and 2D photo captures of insect boxes are the least expensive and have the highest throughput. Interestingly, µCT scanning has the highest annual total cost because of high fixed depreciation costs for X-ray equipment (€63,571 a year). However, the average cost per item remains relatively low because µCT achieves a throughput that offsets the increased costs. Fig.
In conjunction with this costbook analysis,
The minutes per item for transcription ranged significantly from ~30 seconds to up to 41 minutes to fully transcribe label data on a specimen. This large is due to the range of information that is included in the transcription process, the method used and the amount of quality assurance required. For example, georeferencing adds significantly to the time required for transcription, particularly if the label includes only vague location description. Some case studies reported that they did not include georeferencing because of limitations on either time or funding.
Some of the case studies provided examples of either outsourcing transcription to a service like Alembo, using a crowdsourcing platform like DigiVol or testing an automation tool like Google Vision. In each of these cases, staff resources were saved by not requiring museum labour resources for the actual transcription. However their were, in each case, time and money trade-offs for the increased need for project management, volunteer recruitment, quality checks and/or development resources needed to carry out the project.
The analysis showed that, consistent with the other digitisation components studied in this report, time and cost can vary significantly depending on collection type, staff resources and method deployed.
The process of data collection for this study revealed complexities in gathering and assessing accurate cost data. First, there were inconsistences in how workflows are named and categorised. In asking for the specimen type, one institution used ‘mycological’ and another used ‘fungarium’ to describe digitising their fungi collection. The first institution categorised this workflow as a herbarium collection and the latter as ‘Other’. This is indicative of limitations and inconsistencies in the terminologies used to describe collections and, subsequently, how they are categorised and analysed. For the purposes of this study, both were categorised as ‘fungarium’.
An inconsistent approach to describing collections of physical specimens is a wider challenge that the natural science community is attempting to address. While many efforts have been made within and across institutions to generate and share collection descriptions data, the lack of common standards, data model and vocabularies remain a significant barrier to making these datasets comparable and interoperable. The terminology issues described above are a result of this lack of consistency and standardisation across institutional practices.
The Biodiversity Information Standards organisation TDWG, (
Secondly, different workflows were broken out into separate cost books. However, some institutions recorded the same number of employees across multiple workflows and, in some cases, the same time and costs associated with different collections. It is unclear if these were separate but identical costs that could thus be summed, or if they were the same costs and thus a double counting of the same data.
ICEDIG recommends working towards harmonisation of approaches to costing digitisation. This will become more important as various kinds of decision about digitisation are made e.g., prioritisation, allocation of certain types of mass digitisation to specific facilities, budgeting, authorization of on-demand digitisation requests, etc.
For categories of collection where digitisation has been carried out by a significant number of institutions, it’s reasonable to look at the spread of costs achieved and to focus on transferring knowledge and learning points from those institutions of low cost to those where costs are higher, in an effort to increase cost efficiencies.
For categories of collection where digitisation has been carried out by only a few institutions, the aim should be to spread best practice to institutions embarking on digitisation of these categories as a means to avoid repeating past mistakes and accelerating progress towards efficient (low-cost) digitisation across institutions in those categories.
Recommendations on capital equipment choices, whilst probably appropriate for DiSSCo to give guidance on, is out of scope of the present document.
Based on this costbook exercise an ambitious baseline for mass digitisation of pinned and herbarium sheets would be less than €0.50 PPS per item. This is based on a very limited sample of institutions and workflows so should be taken as indicative only. There is not enough data to make suggestions on baseline costs for digitising other specimens but in order to meet DiSSCo’s mass digitisation goals we need to encourage and support continuous improvements to drive that cost down and to increase throughput without increasing per item cost. In practice, also, digitisation projects vary widely, and the degree of data captured should relate to the project aims – where more data is most appropriate (e.g., a key project aim is full georeferencing or some kind analytical treatment of an object) it may well be appropriate to accept a higher baseline cost.
In addition to the discussion points above we recommend the following:
Focus on harmonisation of costing approach – standardisation of the methodology for gathering and reporting costs. We recognise that many institutes will have difficulty gathering and providing detailed cost information and that a simpler costing approach may be required. Focus on cost improvements (efficiencies) – recommend setting a target mass digitisation cost (per specimen) for different types of collection. If we had to set it today, what would we set it at? A strong focus on cost improvement would be one of several means of accelerating progress in mass digitisation. Consider how we can transfer best practice between institutes and digitisation teams. Track digitisation costs over time as standard - we currently have limited data on digitisation costs and if more institutes started recording this data we could better identify effective and ineffective practice.
Anthropological, Palaeontological, Mineralogical and non-insect Invertebrates collections were not included in the scope of ICEDIG digitisation research. While non-insect invertebrates are a major collection type, they were erroneously omitted from the scope of
The costbook work in ICEDIG will be inherited and expanded upon by the DiSSCo Prepare project, specifically in Tasks 4.1 and 4.2, the “Costbook for DiSSCo” and “Cost model for charging services”, and their corresponding reports.
While not directly working on a costbook, SYNTHESYS+ will be gathering and assessing cost data as part of the new Virtual Access workpackage (
In the subsections that follow, we offer some further considerations that other projects in the DiSSCo Programme portfolio should take into account but they apply to any organised large scale digitisation of collections.
The current method for collecting, aggregating and analysing data from different institutions, based on completing pre-formatted spreadsheet templates becomes cumbersome when the number of responding institutions increases and quantities of data increase. Significant manual work is involved both for the institutions in filling templates and for analysts to work with the returned data.
As we noted when considering implementation of the costbook template (see
Regardless of whether Airtable is the specific correct product to adopt, the key learning point is that reliance on old-style spreadsheet products, distributed and managed as files among participants is no longer necessarily the most flexible, efficient or sustainable approach to gathering, collating, analysing and using actual cost information. The recommendation here is that DiSSCo should consider alternatives to the Excel/Google spreadsheets approach for modern management of cost information. However, any change from using commonly used software to a new webform or database will require sufficient support to ensure it is fit for purpose.
Recommendation: DiSSCo must evaluate and adopt modern alternative(s) to traditional spreadsheet approaches for the management of cost information.
Several currencies have been used throughout the cost gathering and analysis work. The NHM UK entered their data in £ sterling. Other institutions entered their data in € euros. For summation, conversions were done to the EC’s PPS Purchasing Power Standard. However, we failed to foresee that we might want to do some analytical calculations, for example stating specific cost components, such as depreciation as proportions (%) of a total annual cost. This involves going back and re-manipulating specific parts of the data.
A more helpful approach would be to convert from the currency used for data entry to PPS for each data item entered, at the time of entry. This would facilitate the kind of calculation exampled above.
Recommendation: In cost gathering, budgeting and accounting, DiSSCo should convert,
As we noted in the
It is evident from anecdotal comments received during the task that practices for recording and breaking out costs, levels of detail of cost records and maturity of accounting for work vary considerably among the responding institutions.
Two elements to communicate best practices about:
Best practice accounting procedures so that quality and level of detail/accuracy of costing data improves Innovations that lead to higher efficiencies/throughputs and lower costs
How then should DiSSCo distil, promote and support dissemination of best practices from established workflows in institutions with high efficiencies and low costs to other institutes that might benefit?
Costs must be treated separately from charges. A cost model is not the same as a charging or business model, and the latter is not part of the present task. Nevertheless, in the end, cost calculations cannot be considered in isolation from a business/charging/organisational model, because of the influence of DiSSCo governance decisions and policy on requirements for digitisation, data access and availability. Digitisation can be required to a certain level. Some data may be more immediately available than other data, according to scientific demand and difficulty to retrieve (faster and easier versus slower and more time-consuming).
In-depth analysis of potential business models is described in
Any business model must, however, take both depreciation and amortization into account.
Depreciation is the process of allocating the capital costs of a tangible asset (such as digitisation equipment or storage systems) over time. It’s a measure of how much of the value of an asset has been consumed to a point in time (usually, the end of an accounting period). Note though, that usage of such equipment can usually extend well beyond the depreciation period. Depreciation is well understood and, especially for IT infrastructure, is typically allocated over three or four years using a straight-line method (i.e. the same amount in each year).
Depreciation is used in statutory accounting for matching costs against income and hence for calculating annual profit or loss. Its use in management accounting (as considered here) is as a means of reflecting the true cost of digitising specimens in years following those in which a digitisation facility was established.
Amortization is the process of allocating the costs of an intangible asset such as data over time (its ‘useful life’). The purpose is to match the costs of creating and maintaining data to the value earned from using that data. Or to put it another way, to ensure that expenses are not incurred in maintaining data with no useful value. Like depreciation, accounting for amortization in multi-year business plans for digitisation is good practice. Because of the multi-stakeholder characteristics of the DiSSCo governance and business model, this is a topic DiSSCo must pay attention to – however this is an area of high complexity where evidence is likely to improve over time.
Accounting for amortization in DiSSCo must match the expense of acquiring, preserving and maintaining ‘FAIR’3 digitised specimen/collection data with the value of the use that data receives over time, usually in a linear fashion over the period of ‘useful life’. Such value, however, can be hard to measure in financial terms - the value of research, education/training and other uses is not usually measured financially, partly because there are no accepted standard methods for doing so. Proxy measures can be useful; such as the number and impact of scientific publications achieved from having the data available; or the number and value of new research grants enabled by digitisation. Such metrics must be tracked from an early stage by the Digitisation Dashboard application.
We know the useful life of physical specimens in collections can easily be measured in decades or hundreds of years. But we also know the usefulness of both individual specimens and collections of specimens varies enormously, according to the scientific and societal questions of the day. What is the useful life of Digital Specimens and Digital Collections? For arriving at a practical basis for valuation and amortization, we must model several scenarios where amortization periods are set at say, 10, 25 and 50-year intervals.
In future, large-scale (mass) and more ‘bespoke’ digitisation can both be operated more frequently on a digitisation-on-demand basis, i.e. fulfilling demands for specimen information by immediately digitising it and making it available on request on efficient digital platforms. There are arguments that this is more cost-effective: adapting words from elsewhere5, we could say that immediate digitisation is better than storage, meaning that it is more cost-effective to rapidly digitise and deliver only what is requested than to systematically and slowly digitise and store everything that is collected. In practice, however, experience to date of systematic digitisation is that its benefits are not always predictable – there is a strong element of serendipity e.g. in use of collections data alongside other data via aggregators; and there can be ‘critical mass’ of data for certain kinds of research (‘big data’ approaches). Sometimes, demand does not exist until data is made available, and data availability can enable new research paradigms and stimulate future demand. NHMUK’s Digital Collections Programme, for example, track citations of digital specimen data – these data have not been created on demand, but the trend in the growth of usage (and therefore benefit/impact) is increasing year on year.
Once digitised, the value of specimen data does not decay quickly. Indeed, the value can even be increased as digitised specimen data is improved and supplemented with links to other information. There are costs associated with this. First, the costs of digitisation; second, the key cost of storage/preservation/serving over long time periods; and third, additional costs associated with data improvement and supplementation. There must be enough steady and measurable benefit over long periods into the DiSSCo business model to balance costs. An additional complexity is across what ‘body’ of data it is meaningful or accurate to apply amortization– the ‘value’ or benefit of data tends to increase in the context of other data, whether through an increase in the size of the same dataset; additional data from related collections datasets; or data from other sources and of other types/content e.g. climate data. While each digitisation project may look at their own dataset for amortisation and to estimate costs, the benefits and value do not accrue in isolation. Thus, the approach towards amortizing costs of data for DiSSCo must be examined very carefully and kept under review over time.
Considering the complexities of the digitisation process, and its variability among institutions and between different types of collections, we conclude that time spent is an essential parameter informing costing information. Other key parameters are labour rates, consumables and fixed cost elements such as heating and lighting, space rental, etc. Actual costs vary from one institution/country to another and our template offers calculators based on simple inputs. Gathered costs can be normalised to take account of different purchasing power of money in different countries.
Optimal digitisation cost is achieved when the volume and availability of specimens ready for digitisation matches the capacity of the digitisation facility. Having enough specimens ready means the digitisation capacity can be effectively utilised and the highest throughput can be achieved, thus leading to the lowest cost (
What an institution wants to know is: When can certain kinds of digitisation be achieved for specific levels of investment? When does it become practical/economic to start digitising a collection? What does it cost to invest for digitisation and to reach a certain level for a collection?
The gathered cost information begins to inform answers to such questions. We have made several recommendations to be carried forward elsewhere in the DiSSCo Programme e.g., as specific work items in the DiSSCo Prepare project, for consideration by the DiSSCo Coordination and Support Office and the DiSSCo General Assembly.
We express our thanks and acknowledgement to the following individuals who assisted with this report and the underlying data:
Hannu Saarenmaa (UH), Ana Casino (CETAF), Xavier Vermeersch (CETAF), Karsten Gödderz (CETAF), Luc Willemse (Naturalis), Michel Guiraud (MNHN), Agnes Wijers (PIC) and Jeroen Bloothoofd (PIC) for contributions towards conception, design and review of the costbook template.
Lousie Allan (NHM) for attempting the completion of a trial costbook sheet to help us iron out difficulties.
Quentin Groom (APM), Mathias Dillen (APM), Anne Koivunen (LUOMUS), Kari Lahti (LUOMUS), Sarah Philips (RBGK), Lousie Allan (NHM), Veljo Runnel (UTARTU) and Vanessa Demanoff (MNHN) for filling and returning 22 completed templates.
Contribution types are drawn from CRediT -
Digitisation scenarios can be characterised along two axes: capability and capacity (or scale).
On the capability axis a spectrum of possibilities for the organisation of digitisation ranges from temporary or permanent inhouse facilities to fully outsourced contracts of digitisation undertaken by commercial companies. On all points of the spectrum, there can be various proportions of professional and volunteer digitisers contributing effort and affecting operating costs accordingly.
On the capacity axis, digitisation activities can range from small-scale, one-off bespoke projects to digitise specific specimens, collections or parts of a collection through to large-scale, long-term mass digitisation programmes aiming to digitise complete holdings of an institution. At multiple points on this axis digitisation-on-demand can also range from sporadic one-off digitisations (special cases) to continuous routine requests for digitisation.
At the time of writing the present article there is no citation available for the proposed MIDS standard. Readers are advised to refer to the Biodiversity Information Standards (TDWG) website,
Cost of consumables per 100 specimens.
Fixed and variable costs as percentages (%) of overall annual costs.
Range of annual throughput per person by specimen type.
Total Monthly Throughput by Total Staff Orange = herbarium; Blue = pinned insects.
Hours taken to digitise 100 objects.
Returned costbooks versus stated capability to digitise from
RBGK 'Other' = fungi collection. MNHN 'Herbarium sheets' = two workflows (day-to-day digitisation in the museum, and Recolnat project workflow). MNHN 'Other' = marine invertebrates collection. UTARTU 'Other' = lichens and fungi. NHMUK 'Pinned insects' = two workflows (standard workflow with label removal, and ALICE workflow with label remaining
Cost per Item (€ PPS).
Cost per item by digitisation method.
Average cost per item (€ PPS) compared to total annual throughput per person (Items).
GBIF “preserved specimens” mapped to natural history collection types: The results of a search of the GBIF data portal carried out on 26th November 2019 to ascertain the proportion of preserved specimens falling into each of the major natural history collection types. Search filtering on the term “preserved specimen” yielded a total of 166,367,960 results. Within these results, the major taxonomic groups can be mapped to collection types as shown.
Main natural history collection types | Percentage of GBIF preserved specimens |
|
47% |
|
46% |
|
4% |
Establishment costs (PPS) for herbarium sheet and pinned insect digitisation capabilities.
Herbarium line (n = 7 stations) | Pinned insect line (n = 5 stations) | |
Minimum equipment cost | €12,937 | €4,109 |
Maximum equipment cost | €40,670 | €40,816 |
Average cost | €35,593 | €17,729 |
Median cost | €35,447 | €8,808 |
Fixed costs as percentage (%) of overall annual digitisation costs.
|
|
|
APM | 14.9% | - - |
LUOMUS | 15.6% | 42.8% |
MNHN | 73.8% (inhouse) 15.2% (ReColNat) | 65.4% |
NHMUK | 98.8% | 100% (ALICE) 100% (Standard) |
RBGK | 46.2% | - - |
UTARTU | 96% | 16.1% |
Component costs as percentage (%) of annual fixed costs (average).
|
|
|
Depreciation | 7.6% | 10% |
Space charge | 7.6% | 6.3% |
Fixed staff cost | 53% | 50.4% |
Overheads | 27.2% | 29.9% |
Other costs | 4.7% | 3.3% |
Variable cost as percentage (%) of overall annual costs.
Institution | Herbarium line | Pinned insect line |
APM | 85.1% | - - |
LUOMUS | 84.4% | 57.2% |
MNHN | 26.2% (inhouse) 84.8% (ReColNat) | 34.7% |
NHMUK | 1.2% | 0% |
RBGK | 53.8% | - - |
UTARTU | 4% | 83.9% |
Fixed and variable costs as percentages (%) of overall annual costs.
Institution | Herbarium line | Pinned insect line | ||
Fixed costs | Variable costs | Fixed costs | Variable costs | |
APM | 14.9% | 85.1% | - - | - - |
LUOMUS | 15.6% | 84.4% | 42.8% | 57.2% |
MNHN | 73.8% (inhouse) 15.2% (ReColNat) | 26.2% 84.8% | 65.3% | 34.7% |
NHMUK | 98.8% | 1.2% | 100.0% (ALICE) 100.0% (Standard) | 0.0% 0.0% |
RBGK | 46.2% | 53.8% | - - | - - |
UTARTU | 96.0% | 4.0% | 16.1% | 83.9% |
Workflow type and staff counts to operate.
Legend: [<fixed staff count>, <variable staff count>]
|
|
|
APM | Manual [0,1] | |
LUOMUS | Semi-automated [0.1,3] | Semi-automated [0.1,1] |
MNHN | Manual (inhouse) [1,1] Automated (ReColNat) [3,3] | Manual [0.8,3] |
NHMUK | Manual [1.12,0] | Semi-automated (ALICE) [1.12,0] Semi-automated (Standard) [1.12,0] |
RBGK | Manual [2.5,2] | |
UTARTU | Manual [0.2,0] | Manual [0.1,1] |
Hours spent at each stage of the digitisation process per 100 objects.
Institution | Country | Specimen Type | Curation | Image Capture | Image Processing | Data Capture | Preservation |
UTARTU | Estonia | Minerals | 50.00 | 8.33 | 8.33 | 8.33 | - |
UTARTU | Estonia | Palaeontological | 50.00 | 8.33 | 16.67 | 8.33 | - |
MNHN | France | Vertebrates | 30.00 | 10.00 | 1.67 | 8.33 | 0.83 |
MNHN | France | Marine invertebrate | 15.83 | 14.17 | 2.50 | 8.33 | 0.83 |
MNHN | France | Palaeontological | 15.83 | 14.17 | 2.50 | 8.33 | 0.83 |
UTARTU | Estonia | Fungarium | 6.67 | 6.67 | 6.67 | 6.67 | 6.67 |
UTARTU | Estonia | Herbarium | 6.67 | 6.67 | 6.67 | 6.67 | 6.67 |
MNHN | France | Pinned insects | 4.33 | 1.67 | 1.67 | 2.00 | 0.08 |
MNHN | France | Herbarium | 3.33 | 0.83 | 0.83 | 2.50 | 0.63 |
RBGK | UK | Fungarium | 2.83 | 2.00 | 0.15 | - | 0.33 |
LUOMUS | Finland | Spirit material | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 |
MNHN | France | Herbarium | 1.75 | 0.17 | 0.02 | 0.15 | 0.08 |
UTARTU | Estonia | Pinned insects | 1.67 | 3.33 | 0.83 | 3.83 | 0.03 |
LUOMUS | Finland | Pinned insects | 1.03 | 0.67 | - | 0.33 | 0.07 |
RBGK | UK | Herbarium | 0.92 | 0.70 | 0.15 | 0.20 | 2.47 |
APM | Belgium | Herbarium | 0.25 | 1.33 | - | 4.00 | - |
LUOMUS | Finland | Herbarium | 0.17 | 0.83 | 0.17 | 1.33 | 0.17 |
Median throughput by digitisation process
Automated | Semi-Automated | Manual | |
Median Monthly Throughput per Person | 7,902 | 5,837 | 1,200 |
Median Cost per Item | €2.49 | €.97 | €5.94 |
Pros and cons of Airtable for costbook work.
|
Data structure and interfaces support one to many relationships in the Institution/Facility/Fixed Costs/Variable Costs data model, which would probably require scripting (with associated security/permissions challenges) in Excel or Google Sheets; Supports calculated fields, enabling spreadsheet calculations to be replicated; Provides basic form interfaces for data entry and grid interfaces for data management and querying – this should be intuitive for new and experienced users; Accessible online for submitting and managing data; Provides an API for programmatic access (e.g., custom forms, power business intelligence reports); Data managed in a single location, which: Reduces data management overheads (e.g. chasing down multiple Excel files and extracting data from each); Enables aggregation (roll-up) and analysis (drill-down) across institutions and facilities; Enables future design changes without having to distribute new Excel files and handle legacy versions |
|
Cannot display calculated fields in form view, only grid view.; Cannot edit an existing record in form view, only grid view.; Native form views are quite simplistic and linear.; Must have a different data entry form for each table, rather than a consolidated form where, for exampleone can add a facility, and then multiple variable costs records, without leaving the form.; Airtable is not free. The paid option is needed to gain access to all functions. |
Cost Template Data Extraction Script
VBA script
Short VBA procedure for extracting data from multiple Excel template sheets into a flattened structure.
File: oo_417898.txt
Costbook Template
Excel Spreadsheet
This costbook template contains separate calculators for establishment (upfront) costs, for fixed costs of digitisation and for variable costs. We strongly recommend that before using again, to modify the costbook template to allocate named ranges to cells.
File: oo_417899.xlsx
Cost Books - Flattened Data
Excel spreadsheet
File: oo_444909.xlsx
Cost Books - Original Responses
Excel Spreadsheet
The original 22 responses from six ICEDIG collections-holding institutions (APM, LUOMUS, MNHN, NHM, RBGK, UTARTU).
File: oo_444910.xlsx
Cost books - RBINS
Excel Spreadsheet
Thirteen costbooks from RBINS covering technique-based digitisation costs (e.g. µCT, photogrammetry, structured light and multispectral imaging).
File: oo_444912.xlsx