Research Ideas and Outcomes : Data Management Plan
|
Corresponding author: Dora Ann Lange Canhos (dora@cria.org.br)
Received: 23 Jun 2017 | Published: 27 Jun 2017
© 2017 Dora Ann Lange Canhos
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Canhos D (2017) Data Management Plan: Brazil's Virtual Herbarium. Research Ideas and Outcomes 3: e14675. https://doi.org/10.3897/rio.3.e14675
|
|
The goal of the Brazil Virtual Herbarium is to facilitate the identification of taxonomic and geographic information gaps of plants and fungi of Brazil. The system displays the status of online data for all valid species in the List of Species of the Brazilian Flora, including those without any record. The system also compares the Brazilian states where specialists indicate that the species occurs with the states that have occurrence points in Brazil's Virtual Herbarium, highlighting the gaps. This data management plan was prepared as part of a pilot project run on behalf of the International Development Research Centre (Canada) on data management policy for development funders (https://doi.org/10.3897/rio.2.e8880).
herbaria, botany, data management plan, IDRC, infrastructure, research data management
BVH uses the speciesLink network as its information system, an aggregator of species occurrence records. Most data providers are herbaria from Brazil and abroad that collect, preserve, and document the occurrence of specimens in nature, but some datasets refer to observations and not collection of specimens. Herbaria from abroad are contributing with data of samples collected in Brazil.
Each textual data record may include:
Data records may be incomplete, as one of the aims of BVH is to help herbaria in improving data quality.
Images of the specimen (voucher or live) associated to the textual record may also be available as a separate file. Enter subsection text
The data model used is Darwin Core 2 standard (see http://rs.tdwg.org/dwc). Data providers can use practically any software and data is sent as Raw Data to a PostgreSQL database. Softwares used today include Brahms, BioCase, IPT, DiGIR Provider, Firebird, MS-Access, MS-Excel, PostgreSQL, Sonnerat, and speciesBase. Data is accessed through an on-line search interface and can be viewed as an HTML file or plotted in maps, charts, or downloaded in formats compatible with MS-Excel 2007 (.xlsx), MS-Excel 2003 (.xls) or as a UTF8 tab delimited text file.
Images standard formats include TIFF, JPEG, and PNG.
Yes.
An important concept is that each data provider is responsible for his/her data. Any modification, correction of possible errors must be done by the data provider who then sends updates to the network. speciesLink indexes the contents of standard fields that are made freely and openly available by the data provider to all interested.
Each data record has (or should have) the date the specimen was collected, the date it was identified and each dataset also has the date it last sent data to the network. speciesLink does not control versions, meaning, does not store versions over time, as updating is dynamic (5 to 15 datasets per day), growing at an average rate of 30 to 40 thousand new data records per month. A data indexer retrieves data every night from updated datasets and this data is processed throughout the day.
What speciesLink offers under citation, is a clear indication as to the source, date, and time the records were retrieved from the network.
Darwin Core, the data model used, is fully documented and follows a common structure that has been used by biological collections for more than 200 years. New data fields are added over time as a result of advancements in science.
Darwin Core data model, established in 2003, is used to facilitate interoperability between different data sources, and has evolved over time through TWDG – Biodiversity Information Standards. Data quality is assessed through a number of tools and applications and a report is prepared indicating suspect, inconsistent, and incomplete data to help curators in identifying possible errors.
Darwin Core
speciesLink currently uses about 15 TB of total storage, mostly used by images. Our current available disk space is of about 22TB that we envisage to be sufficient for the next three years of the project. It is important to note that storage space is only one of the requirements among others like servers, disks lifetime, disks speed, servers’ memory, database technologies, etc.
All public information systems developed and maintained by CRIA, including speciesLink, are stored at the Internet Data Center (IDC) maintained by the Brazilian National Research and Educational Network (RNP) in Brasília. The systems are managed by CRIA in Campinas through a Virtual Private Network. A diagram of the architecture is shown in Fig.
As can be seen in the diagram, there is a server in Brasília and another in Campinas responsible for backup. Backups of the databases and systems are carried out daily in Brasília and stored in disks. Once a week the backups are transferred to Campinas. As an additional safety measure, every month, the backups are transcribed on a tape and once every 6 months one copy is physically stored at Embrapa Informática Agropecuária, an institution based at the State University of Campinas.
The images are stored in a SAN (storage area network) in Brasília, managed by a specialized image software and backed up every day to our backup server in Campinas.
Each data provider (herbaria) is responsible for modifying, correcting and sending data to the network. In the case of national herbaria, a software developed by CRIA named spLinker
Each herbaria maintains its own data together with the voucher and only sends a subset of this data to the network. So each herbaria must be responsible for the long-term preservation of the data and of the associated voucher. All data publicly shared through speciesLink is managed and maintained by CRIA. A threat is the discontinuity of support to CRIA and to the herbaria.
The speciesLink network uses international standards that are also used by similar e-infrastructures worldwide. A user interface for searching and analyzing data is in place as are web services.
All data sent to the speciesLink network can be downloaded as xls, xlsx, and text delimited using tab files. Data is also served through web services and IPT DwC-A archives.
All on-line data is available through a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. Users are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) under the following terms:
speciesLink is a well-known e-infrastructure among the scientific community in Brazil as it has been on-line for 16 years. 95% of the herbaria that are part of BVH are associated to graduate courses, an important community that routinely uses speciesLink. Over 400 million plant records were used on-line in 2015 and 212 million in 2016 (June 21, 11:22). This represents 41 times the data available.
To disseminate new developments and increase speciesLink’s visibility, CRIA maintains a Blog and a Facebook account.
Each data provider (herbaria) is responsible for managing its own data and all data sent to speciesLink is managed by CRIA.
Small herbaria depend largely on projects for digitization. Without project support the entry of new data and further work on its quality may suffer discontinuity or may slow down. CRIA, responsible for speciesLink, also depends on projects to maintain its personnel. Lack of support may also lead to discontinuity. As to the change of the Principal Investigator, it certainly represents a loss, but the project’s coordinator organized a steering committee to establish strategies and evaluate results. So a change of the Principal Investigator would not represent discontinuity.
The overall cost to maintain speciesLink is of about 400 thousand US Dollars a year.
Sensitive or confidential data, determined as such by each herbaria, is not sent to the network. They are excluded or specific data fields are marked as blocked at the origin.
All data sent to the network is necessarily open. But if any data field of specific records are blocked (such as geographic coordinates), users receive the information that it was blocked. This way, the system distinguishes blocked data from no data. Users are able to identify the existence of data and can request it directly from the herbaria.
All data sent to the network is shared under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. A non-binding memorandum of understanding is signed between each herbaria and CRIA that states the obligations and responsibilities of CRIA and the Data Provider.