Data Management Plan : Brazil ' s Virtual Herbarium

The goal of the Brazil Virtual Herbarium is to facilitate the identification of taxonomic and geographic information gaps of plants and fungi of Brazil. The system displays the status of online data for all valid species in the List of Species of the Brazilian Flora, including those without any record. The system also compares the Brazilian states where specialists indicate that the species occurs with the states that have occurrence points in Brazil's Virtual Herbarium, highlighting the gaps. This data management plan was prepared as part of a pilot project run on behalf of the International Development Research Centre (Canada) on data management policy for development funders (https://doi.org/10.3897/rio.2.e8880).

observations and not collection of specimens.Herbaria from abroad are contributing with data of samples collected in Brazil.

Each textual data record may include:
• what was collected/observed (species name and who identified the specimen, date identified); • who collected/observed (collector name and number); • when (date: DD/MM/YYYY); • where a specimen was collected or observed (country, state, county, geographic coordinates, precision, description of locality); • collection code and number; • whether it is a type specimen; • observations (such as barcode).
Data records may be incomplete, as one of the aims of BVH is to help herbaria in improving data quality.
Images of the specimen (voucher or live) associated to the textual record may also be available as a separate file.Enter subsection text

What file formats will your data be collected in?
The data model used is Images standard formats include TIFF, JPEG, and PNG.
Will these formats allow for data re-use, sharing and long-term access to the data? Yes.
What conventions and procedures will you use to structure, name and versioncontrol your files to help you and others better understand how your data are organized?
An important concept is that each data provider is responsible for his/her data.Any modification, correction of possible errors must be done by the data provider who then sends updates to the network.speciesLink indexes the contents of standard fields that are made freely and openly available by the data provider to all interested.
Each data record has (or should have) the date the specimen was collected, the date it was identified and each dataset also has the date it last sent data to the network.speciesLink does not control versions, meaning, does not store versions over time, as updating is dynamic (5 to 15 datasets per day), growing at an average rate of 30 to 40 thousand new data records per month.A data indexer retrieves data every night from updated datasets and this data is processed throughout the day.
What speciesLink offers under citation, is a clear indication as to the source, date, and time the records were retrieved from the network.

Documentation and Metadata
What documentation will be needed for the data to be read and interpreted correctly in the future?
Darwin Core, the data model used, is fully documented and follows a common structure that has been used by biological collections for more than 200 years.New data fields are added over time as a result of advancements in science.

Storage and Backup
What are the anticipated storage requirements for your project, in terms of storage space (in megabytes, gigabytes, terabytes, etc.) and the length of time you will be storing it?
speciesLink currently uses about 15 TB of total storage, mostly used by images.Our current available disk space is of about 22TB that we envisage to be sufficient for the next three years of the project.It is important to note that storage space is only one of the requirements among others like servers, disks lifetime, disks speed, servers' memory, database technologies, etc.
How and where will your data be stored and backed up during your research project?
All As can be seen in the diagram, there is a server in Brasília and another in Campinas responsible for backup.Backups of the databases and systems are carried out daily in Brasília and stored in disks.Once a week the backups are transferred to Campinas.As an additional safety measure, every month, the backups are transcribed on a tape and once every 6 months one copy is physically stored at Embrapa Informática Agropecuária, an institution based at the State University of Campinas.The images are stored in a SAN (storage area network) in Brasília, managed by a specialized image software and backed up every day to our backup server in Campinas.
How will the research team and other collaborators access, modify, and contribute data throughout the project?
Each data provider (herbaria) is responsible for modifying, correcting and sending data to the network.In the case of national herbaria, a software developed by CRIA named spLinker CRIA 2009 is installed and maps data fields (in accordance to Darwin Core) and enables updates that are sent to regional servers.An indexer goes out every night to look for updates and sends the data to the network manager and to the central repository.Most international collections use IPT (GBIF's Internet Publishing Toolkit), but CRIA adapts to the system used by the data providers.Fig. 2 Preservation Where will you deposit your data for long-term preservation and access at the end of your research project?
Each herbaria maintains its own data together with the voucher and only sends a subset of this data to the network.So each herbaria must be responsible for the long-term preservation of the data and of the associated voucher.All data publicly shared through speciesLink is managed and maintained by CRIA.A threat is the discontinuity of support to CRIA and to the herbaria.The speciesLink network uses international standards that are also used by similar einfrastructures worldwide.A user interface for searching and analyzing data is in place as are web services.
public information systems developed and maintained by CRIA, including speciesLink, are stored at the Internet Data Center (IDC) maintained by the Brazilian National Research and Educational Network (RNP) in Brasília.The systems are managed by CRIA in Campinas through a Virtual Private Network.A diagram of the architecture is shown in Fig. 1, highlighting the backup servers in both sites, CRIA and IDC.

Figure 1 .
Figure 1.System architecture of the BVH and supporting systems.

Figure 2 .
Figure 2.Information and data flows in the BVH system.

make sure that documentation is created or captured consistently throughout your project?
Darwin Core data model, established in 2003, is used to facilitate interoperability between different data sources, and has evolved over time through TWDG -Biodiversity Information Standards.Data quality is assessed through a number of tools and applications and a report is prepared indicating suspect, inconsistent, and incomplete data to help curators in identifying possible errors.