Research Ideas and Outcomes : Guidelines
|
Corresponding author: Lyubomir Penev (penev@pensoft.net)
Received: 26 Feb 2017 | Published: 28 Feb 2017
© 2018 Lyubomir Penev, Daniel Mietchen, Vishwas Chavan, Gregor Hagedorn, Vincent Smith, David Shotton, Éamonn Ó Tuama, Viktor Senderov, Teodor Georgiev, Pavel Stoev, Quentin Groom, David Remsen, Scott Edmunds
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Penev L, Mietchen D, Chavan V, Hagedorn G, Smith V, Shotton D, Ó Tuama É, Senderov V, Georgiev T, Stoev P, Groom Q, Remsen D, Edmunds S (2017) Strategies and guidelines for scholarly publishing of biodiversity data. Research Ideas and Outcomes 3: e12431. https://doi.org/10.3897/rio.3.e12431
|
The present paper describes policies and guidelines for scholarly publishing of biodiversity and biodiversity-related data, elaborated and updated during the Framework Program 7 EU BON project, on the basis of an earlier version published on Pensoft's website in 2011. The document discusses some general concepts, including a definition of datasets, incentives to publish data and licenses for data publishing. Further, it defines and compares several routes for data publishing, namely as (1) supplementary files to research articles, which may be made available directly by the publisher, or (2) published in a specialized open data repository with a link to it from the research article, or (3) as a data paper, i.e., a specific, stand-alone publication describing a particular dataset or a collection of datasets, or (4) integrated narrative and data publishing through online import/download of data into/from manuscripts, as provided by the Biodiversity Data Journal.
The paper also contains detailed instructions on how to prepare and peer review data intended for publication, listed under the Guidelines for Authors and Reviewers, respectively. Special attention is given to existing standards, protocols and tools to facilitate data publishing, such as the Integrated Publishing Toolkit of the Global Biodiversity Information Facility (GBIF IPT) and the DarwinCore Archive (DwC-A).
A separate section describes most leading data hosting/indexing infrastructures and repositories for biodiversity and ecological data.
biodiversity data publishing, data publishing licenses, Darwin Core, Darwin Core Archive, data re-use, data repository
Data publishing in this digital age is the act of making data available on the Internet, so that they can be downloaded, analysed, re-used and cited by people and organisations other than the creators of the data (
The present guidelines are based on an earlier version published in PDF on Pensoft's website in 2011 (
The FORCE11 group dedicated to facilitating change in knowledge creation and sharing, recognising that data should be valued as publisheable and citable products of research, has developed a set of principles for publishing and citing such data. The FAIR Data Publishing Group formulated the following four FAIR principles of fata publishing (
A key outcome of FORCE11 is the Joint Declaration of Data Citation Principles (see also
The Research Data Alliance (RDA) promotes the open sharing of data by building upon the underlying social and technical infrastructure. Established in 2013 by the European Union, the National Science Foundation and the National Institute of Standards and Technology (USA) as well as the Department of Innovation (Australia), it has grown to include some 4,200 members from 110 countries who collaborate through Work and Interest Groups "to develop and adopt infrastructure that promotes data-sharing and data-driven research, and accelerate the growth of a cohesive data community that integrates contributors across domain, research, national, geographical and generational boundaries" (
One RDA output, the Scholix Inititive, under the RDA/WDS (ICSU World Data System) Publishing Data Services Work Group is of particular relevance, as it seeks to develop an interoperability framework for exchanging information about the links between scholarly literature and data, i.e., what data underpins literature and what literature references data.
Within RDA, a Biodiversity Data Integration Interest Group has been established, which aims to "increase the effectiveness of biodiversity e-Infrastructures by promoting the adoption of common tools and services establishing data interoperability within the biodiversity domain, enabling the convergence on shared terminology and routines for assembling and integrating biodiversity data."
With regard to biodiversity, some recently published papers emphasise the importance of publishing of biodiversity data (
The EU BON project funded by the European Union's Framework Program Seven (FP7) (Building the European Biodiversity Observation Network, grant agreement ENV30845) was launched to contribute towards the achievement of these challenging tasks within a much wider global initiative, the Group on Earth Observations Biodiversity Observation Network (GEO BON), which itself is a part of the Group of Earth Observation System of Systems (GEOSS). A key feature of EU BON is the delivery of near-real-time data, both from on-ground observation and remote sensing, to the various stakeholders to enable greater interoperability of different data layers and systems, and provide access to improved analytical tools and services; furthermore, EU BON is supporting biodiversity science-policy interfaces, facilitate political decisions for sound environmental management (
The present paper outlines the strategies and guidelines needed to support the scholarly publishing and dissemination of biodiversity data, that is publishing through the academic journal networks.
A dataset is understood here as a digital collection of logically connected facts (observations, descriptions or measurements), typically structured in tabular form as a set of records, with each record comprising a set of fields, and recorded in one or more computer data files that together comprise a data package. Certain types of research datasets, e.g., a video recording of animal behaviour, will not be in tabular form, although analyses of such recordings may be. Within the domain of biodiversity, a dataset can be any discrete collection of data underlying a paper – e.g., a list of all species occurrences published in the paper, data tables from which a graph or map is produced, digital images or videos that are the basis for conclusions, an appendix with morphological measurements, or ecological observations.
More generally, with the development of XML-based publishing technologies, the research and publishing communities are coming to a much wider definition of data, proposed in the BioMed Central (BMC) position statement on open data: "the raw, non-copyrightable facts provided in an article or its associated additional files, which are potentially available for harvesting and re-use" (
As these examples illustrate, while the term "dataset" is convenient and widely used, its definition is vague. Data repositories such as Dryad, wishing for precision, do not use the term "dataset". Instead, they describe data packages to which metadata and unique identifiers are assigned. Each data package comprises one or more related data files, these being data-containing digital files in defined formats, to which unique identifiers and metadata are also assigned. Nevertheless, the term "dataset" is used below, except where a more specific distinction is required.
For practical reasons, we propose a clear distinction between static data that represent specific completed compilations of data upon which the analyses and conclusions of a given scientific paper may be based, and curated data that belong to a large data collection (usually called a "database") with ongoing goals and curation, for example the various bioinformatics databases that curate ever growing amounts of nucleotide sequences (
Curated data, on the other hand, are usually hosted on external servers or in data hosting centres. A primary goal of the data publishing process in this case is to guarantee that these data are properly described, up to date, available to others under appropriate licensing schemes, peer-reviewed, interoperable, and where appropriate linked from a research article or a data paper at the time of publication. Especially in cases where the long-term viability of the curated project may be insecure (e.g. in the case of grant funded projects) (
Data publishing has become increasingly important and already affects the policies of the world's leading science funding frameworks and organizations — see for example the NSF Data Management Plan Requirements, the data management policies of the National Institutes of Health (NIH), Wellcome Trust, or the Riding the Wave (How Europe Can Gain From the Rising Tide of Scientific Data) report submitted to the European Commission in October 2010. More generally, the concept of "open data" is described in the Protocol for Implementing Open Access Data, the Open Knowledge/Data Definition, the Panton Principles for Open Data in Science, and the Open Data Manual. There are several incentives for authors and institutions to publish data (after
There are four main routes for scholarly publication of data, most of which are available with various journals and publishers:
Within these main data publishing modes, Pensoft developed a specific set of applications designed to meet the needs of the biodiversity community. Most of these were implemented in the Biodiversity Data Journal and its associated ARPHA Writing Tool (AWT):
Best practice recommendations
This section originates from a draft set of Data Citation Best Practice Guidelines that has been developed for publication by David Shotton, with assistance from colleagues at Dryad and elsewhere, and from earlier papers concerning data citation mechanisms (
The well-established norm for citing genetic data, for example, is that one simply cites the GenBank identifier (accession number) in the text. Similar usage is also commonplace for items in other bioinformatics databases. The latest developments in the implementation of the data citation principles, however, strongly recommend references to data to be included in the reference lists, similarly to literature references (
For such data in data repositories, each published data package and each published data file should always be associated with a persistent unique identifier. A Digital Object Identifier (DOI) issued by DataCite, or CrossRef, should be used wherever possible. If this is not possible, the identifier should be one issued by the data repository or database, and should be in the form of a persistent and resolvable URL. As an example, the use of DOIs in the Dryad Data Repository is explained on the Dryad wiki.
Data citations may relate either to the author's own data, or to data created and published by others ("third-party data"). In the former case, the dataset may have been previously published, or may be published for the first time in association with the article that is now citing it. All these types of data should, for consistency, be cited in the same manner.
Best practice recommendations
As is the norm when citing another research article, any citation of a data publication, including a citation of one's own data, should always have two components:
We recommend that the in-text citation statement also contains a separate citation of the research article in which the data were first described, if such an article exists, with its own in-text reference pointer to a formal article reference in the paper's reference list, unless the paper being authored is the one providing that first description of the data. If the in-text citation statement includes the DOI for the data (a strongly desirable practice), this DOI should always be presented as a dereferenceable URI, as shown below. Further to this, both DataCite and CrossRef recommend displaying DOIs within references as full URLs, which serve a similar function as a journal volume, issue and page number do for a printed article, and also give the combined advantages of linked access and the assurance of persistence (
For example, Dryad recommends to cite always both the article in association with which data were published and the data themselves (Fig.
The data reference in the article's reference list should contain the minimal components recommended by the FORCE11 Data Citation Synthesis Group (
These components should be presented in whatever format and punctuation style the journal specifies for its references.
The following example demonstrates in general terms what is required.
In-text citation:
“This paper uses data from the [name] data repository at https://doi.org/***** (Jones et al. 2008a), first described in Jones et al. 2008b. “
Data reference and article reference in reference list:
Jones A, Bloggs B, Smith C (2008a). <Title of data package>. <Repository name>. doi: https://doi.org/#####. [Version and/or date of access].
Jones A, Saul D, Smith C (2008b). <Title of journal article>. <Journal> <Volume>: <Pages>. doi: https://doi.org/#####.
Note that the authorship and the title of the data package may, for valid academic reasons, differ from those of the author's paper describing the data: indeed, to avoid confusion of what is being referenced, it is highly desirable that the titles of the data package and of the associated journal article are clearly different.
Requirements for data citation in Pensoft's journals
1. When referring to the author's own newly published data, cited from within the paper in which these data are first described, the citation statement and the data reference should take the following form:
The following example demonstrates what is required.
In-text citation:
“The data underpinning the analysis reported in this paper were deposited in the Dryad Data Repository at https://doi.org/10.5061/dryad.t63mn (
AND/OR
"The data underpinning the analysis reported in this paper were deposited in the Global Biodiversity Information Facility (GBIF) at http://ipt.pensoft.net/resource?r=montenegrina&v=1.5 (the URI should be used as identifier only in cases when DOI is not available) (
Data reference in reference list:
Macías-Hernández N, de la Cruz López S, Roca-Cusachs M, Oromí P, Arnedo MA (2016) Data from: A geographical distribution database of the genus Dysdera in the Canary Islands (Araneae, Dysderidae). Dryad Digital Repository. https://doi.org/10.5061/dryad.t63mn [Version and/or date of access].
AND/OR
Feher Z, Szekeres M (2016): Geographic distibution of the rock-dwelling door-snail genus Montenegrina Boettger, 1877 (Mollusca, Gastropoda, Clausiliidae). v1.5. ZooKeys. Dataset/Occurrence deposited in the GBIF. doi: https://doi.org/10.15468/###### OR http://ipt.pensoft.net/resource?r=montenegrina&v=1.5, (the latter to be used in cases when DOI is not available). [Version and/or date of access].
2. When acknowledging re-use in the paper of previously published data (including the author's own data) that is associated with another published journal article, the citation and reference should take the same form, except that the full correct DOI should be employed, and that the journal article first describing the data should also be cited:
The following example demonstrates what is required.
In-text citation:
“The data underpinning this analysis were obtained from the Dryad Data Repository at https://doi.org/10.5061/dryad.t63mn (
Data reference and article reference in reference list:
Macías-Hernández N, de la Cruz López S, Roca-Cusachs M, Oromí P, Arnedo MA (2016) A geographical distribution database of the genus Dysdera in the Canary Islands (Araneae, Dysderidae). ZooKeys 625: 11-23. https://doi.org/10.3897/zookeys.625.9847.
Macías-Hernández N, de la Cruz López S, Roca-Cusachs M, Oromí P, Arnedo MA (2016) Data from: A geographical distribution database of the genus Dysdera in the Canary Islands (Araneae, Dysderidae). Dryad Digital Repository. https://doi.org/10.5061/dryad.t63mn [Version and/or date of access].
3. When acknowledging re-use of previously published data (including the author's own data) that has NO association with a published research article, the same general format should be adopted, although a reference to a related journal article clearly cannot be included:
The following real example demonstrates what is required.
In-text citation:
“The present paper used data deposited by the Zoological Institute of the Russian Academy of Sciences in the Global Biodiversity Information Facility (GBIF) at https://doi.org/10.15468/c3eork (
Data reference in reference list:
Volkobitsh M, Glikov A, Khalikov R (2017) Catalogue of the type specimens of Polycestinae (Coleoptera: Buprestidae) from research collections of the Zoological Institute, Russian Academy of Sciences. Zoological Institute, Russian Academy of Sciences, St. Petersburg, deposited in GBIF. https://doi.org/10.15468/C3EORK. [Version and/or date of access].
One of the basic postulates of the Panton Principles is that data publishers should define clearly the license or waiver under which the data are published, so re-use rights are clear to potential users. They recommend use of the most liberal licenses, or of public domain waivers, to prevent legal and operational barriers for data sharing and integration. For clarity, we list here the short version of the Panton Principles:
A domain-specific implementation of the open access principles for biodiversity data was elaborated during the EU project pro-iBiosphere and resulted in the widely endorsed Bouchout Declaration for Open Biodiversity Knowledge Management. Further, the EU project EU BON analysed the current copyright legislation and data policies in various European countries and elaborated a set of best practice recommendations (
In practice, a variety of waivers and licenses exist that are specifically designed for and appropriate for the treatment of data, as listed in Table
Data publishing licenses recommended by Pensoft.
Data publishing license | URL |
Open Data Commons Attribution License | http://www.opendatacommons.org/licenses/by/1.0/ |
Creative Commons CC-Zero Waiver | http://creativecommons.org/publicdomain/zero/1.0/ |
Open Data Commons Public Domain Dedication and License | http://www.opendatacommons.org/licenses/pddl/1-0/ |
The default data publishing license used by Pensoft is the Open Data Commons Attribution License (ODC-By), which is a license agreement intended to allow users to freely share, modify, and use the published data(base), provided that the data creators are attributed (cited or acknowledged).
As an alternative, the other licenses or waivers, namely the Creative Commons CC0 waiver (also cited as “CC-Zero” or “CC-zero”) and the Open Data Commons Public Domain Dedication and Licence (PDDL), are also STRONGLY encouraged for use in Pensoft journals. According to the CC0 waiver, "the person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighbouring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission."
Publication of data under a waiver such as CC0 avoids potential problems of "attribution stacking" when data from several (or possibly many) sources are aggregated, remixed or otherwise re-used, particularly if this re-use is undertaken automatically. In such cases, while there is no legal requirement to provide attribution to the data creators, the norms of academic citation best practice for fair use still apply, and those who re-use the data should reference the data source, as they would reference others' research articles.
The Attribution-ShareAlike Open Data Commons Open Database License (OdbL) is NOT recommended for use in Pensoft's journals, because it is very difficult to comply with the share-alike requirement in scholarly publishing. Nonetheless, it may be used as an exception in particular cases.
Many widely recognized open access licenses are intended for text-based publications to which copyright applies, and are not intended for, and are not appropriate for, data or collections of data which do not carry copyright. Creative Commons licenses apart from CC-Zero waiver (e.g., CC-BY, CC-BY-NC, CC-BY-NC-SA, CC-BY-SA, etc.) as well as GFDL, GPL, BSD and similar licenses widely used for open source software, are NOT appropriate for data, and their use for data associated with Pensoft journal articles is strongly discouraged.
Authors should explicitly inform the publisher if they want to publish data associated with a Pensoft journal article under a license that is different from the Open Data Commons Attribution License (ODC-By), Creative Commons CC0, or Open Data Commons Public Domain Dedication and Licence (PDDL).
Any set of data published by Pensoft, or associated with a journal article published by Pensoft, must always clearly state its licensing terms in both a human-readable and a machine-readable manner.
Where data are published by a public data repository under a particular license, and subsequently associated with a Pensoft research article or data paper, Pensoft journals will accept that repository license as the default for the published datasets.
Images, videos and similar "artistic works" are usually covered by copyright "automatically", unless specifically placed in the public domain by use of a public domain waiver such as CC0. Where copyright is retained by the creator, such multimedia entities can still be published under an open data attribution license, while their metadata can be published under a CC0 waiver.
Databases can contain a wide variety of types of content (images, audiovisual material, and sounds, for example, as well as tabular data, which might all be in the same database), and each may have a different license, which must be separately specified in the content metadata. Databases may also automatically accrue their own rights, such as the European Union Database Right, although no equivalent database right exists in the USA. In addition, the contents of a database, or the database itself, can be covered by other rights not addressed here (such as private contracts, trademark over the name, or privacy rights / data protection rights over information in the contents). Thus, authors are advised to be aware of potential problems for data re-use from databases, and to clear other rights before engaging in activities not covered by the respective license.
Open data repositories (public databases, data warehouses, data hosting centres) are subject- or institution-oriented infrastructures, usually based at large national or international institutions. These provide data storage and preservation according to widely accepted standards, and provide free access to their data holdings for anyone to use and re-use under the minimum requirement of attribution, or under an open data waiver such as the CC0 waiver. We do NOT include here and do NOT recommend for use repositories which provide data after permission or by other methods of human-controlled registration.
Advantages of depositing data in internationally recognised repositories include:
There are several directories of data repositories relevant to biodiversity and ecological data, such as re3data, or those listed in the Open Access Directory, or in Table 2 of
A very useful resource that puts together information on journal data policies, repositories, and standards grouped by domain, type of data, and organisation is BioSharing (
Such repositories could be used to host data associated with a published data paper, as explained below. For their own data, authors are advised to use an internationally recognised, trusted (normally ISO-certified), specialized repository (see
There are several aggregators and registries of taxonomic data, which differ in their content, policies and methods of data submisison.
The Global Biodiversity Information Facility (GBIF) was established in 2001 and is now the world's largest multilateral initiative for enabling free and open access to biodiversity data via the Internet. It comprises a network of 54 countries and 39 international organisations that contribute to its vision of "a world in which biodiversity information is freely and universally available for science, society, and a sustainable future". It seeks to fulfil this mission by promoting an international data infrastructure through which institutions can publish data according to common standards, thus enabling research that had not been possible before. The GBIF network facilitates access to over 704 million species occurrences in 30,894 datasets sourced from 867 data-publishing institutions (as of January 2017).
GBIF is not a repository in the strict sense, but a distributed network of data publishers and local data hosting centres that publish data based on community-agreed standards for exchange/sharing of primary biodiversity data. At a global scale, discovery and access to data is facilitated through the GBIF data portal. Pensoft facilitates publishing of data and metadata to the GBIF network through Pensoft’s IPT Data Hosting Center, which is based on the GBIF Integrated Publishing Toolkit (IPT) (
The Darwin Core Archive (DwC-A) (see also http://rs.tdwg.org/dwc/terms/guides/text/index.htm and
The format is defined in the Darwin Core Text Guidelines. Darwin Core is no longer restricted to occurrence data, and together with the more generic Dublin Core metadata standard (on which its ideas are based), it is used by GBIF and others to encode data about organism names, taxonomies, species information, and, more recently, sample data (i.e., data from ecological/environmental investigations that are typically quantitative and adhere to standardised protocols, so that changes and trends in populations can be detected).
GBIF has produced a series of documents and supporting tools that focus primarily on data publishing using the Darwin Core standard. Guides are available for publishing:
Besides the GBIF Integrated Publishing Toolkit, there are two additional tools developed for producing Darwin Core Archives:
The Darwin Core Archive (DwC-A) files can be used to publish data underlying any taxonomic revision or checklist through the GBIF IPT or as supplementary files (see
As of version 2.2, the GBIF IPT incorporates use of DOIs allowing data publishers to automatically connect with either DataCite or EZID for DOI assignment. GBIF will issue DOIs for all newly published datasets where absent while recognizing and displaying publisher-assigned DOIs for existing datasets. The GBIF IPT now also requires publishers to select one of three standardised machine-readable data waivers or licenses (CC0, CC-BY, CC-BY-NC) for their data to clarify the conditions for re-use.
Images can be deposited at generic repositories, such as Zenodo, figshare or Flickr. There are also specialized repositories for biodiversity images:
There are relatively few repositories dealing with phylogentic data, of which we recommend the following:
Pensoft journals collaborate with four repositories for genomic data, albeit with the assumption that no matter where gene sequence data will be deposited, they should finally be submitted also to GenBank. Data and metadata formatting should comply with the Genomic Standards Consortium (GSC) sample metadata guidelines respectively, allowing data interoperability across the wider genomics community. Inclusion of the hyperlinked accession numbers in the article is a prerequisite for publication in Pensoft journals. The most important repositories for genomic data are:
Best practice recommendations for biodiversity genomic data
Metabolomics
Metabolomics data should be deposited in any of the member databases of the Metabolomexchange data aggregation and notification consortium. Such partners, for example, are the EMBL-EBI MetaboLights repository and the Metabolomics Workbench of NIH, which are data archives for metabolomics experiments and derived information.
Proteomics
Proteomics data should be deposited in any of the members of the ProteomeXchange consortium and following the MIAPE (The Minimum Information About a Proteomics Experiment) guidelines. The founding members of ProteomeXchange are Pride, the PRoteomics IDEntifications Database at the EMBL-EBI and PeptideAtlas, part of the Institute of Systems Biology in Seattle, USA. The other two repositories at ProteomeXchange are MassIVE and jPost.
Dryad Data Repository
Pensoft encourages authors to deposit data underlying biological research articles in the Dryad Data Repository in cases where no suitable more specialized public data repository (e.g., GBIF for species-by-occurrence data and taxon checklists, or GenBank for genome data) exists. Dryad is particularly suitable for depositing data packages consisting of different types of data, for example datasets of species occurrences, environmental measurements, and others.
Pensoft supports Dryad and its goal of enabling authors to publicly archive sufficient data to support the findings described in their journal articles. Dryad is a safe, sustainable location for data storage, and there are no restrictions on data format. Note that data deposited in Dryad are made available for re-use through the Creative Commons CC0 waiver, detailed above.
Data deposition in Dryad is a subject to a small charge that the authors or their institutions should regulate directly with Dryad.
Data can be deposited with Dryad either before or at the time of submission of the manuscript to the journal, or after the manuscript acceptance but before submission of the finally revised, ready-for-layout version for publication. Nonetheless, the authors should always aim at depositing data before submission of the manuscript, so that they can be linked both from and to the manuscript and made freely available for peer-review.
The data deposition at Dryad is integrated with the workflow in Pensoft's ARPHA Journal Publishing System. The acceptance letters automatically generated by email by Pensoft's journals on the day of acceptance of a manuscript contain instructions on how to upload data underpinning the article to Dryad, if desired by the authors (see this blog post for details).
Once you deposit your data package, it receives a unique and stable identifier, namely a DataCite DOI. Individual data files within this package are given their own DOIs, based on the package DOI, as do subsequent versions of these data files, as explained under DOI usage on the Dryad wiki. You should include appropriate Dryad DOIs in the final text of the manuscript, both in the in-text citation statement in the Data Resources section and in the formal data reference in your paper's reference list, as explained and exemplified above. This is very important, since if the data DOI does not appear in the final published article, that greatly weakens its connection to the underlying data.
More information about depositing data in Dryad can be found at http://www.datadryad.org/repo/depositing.
You may wish to take a look at some example data packages in Dryad to see how data packages related to published articles are displayed, such as doi: 10.5061/dryad.7994 and doi: 10.5061/dryad.8682.
Data deposited in Dryad in association with Pensoft journal articles will be made public immediately upon publication of the article.
Zenodo
Zenodo is a research data repository launched in 2013 by the EU-Funded OpenAIRE project and CERN to provide a place for researchers to deposit datasets of up to 50 GB in any subject area. Zenodo code is open source, and is built on the foundation of the Invenio digital library which is also open source. The work-in-progress, open issues, and roadmap are shared openly in GitHub, and contributions to any aspect are welcomed from anyone. All metadata is openly available under CC0 waiver, and all open content is openly accessible through open APIs.
Zenodo assigns a DataCite DOI to each stored research object, or uses the original DOIs of the articles or research objects, if available. Scientists may use Zenodo to store any kind of data that can thereafter be linked to and cited in research articles.
The repository allows non-open-access materials to be uploaded but not displayed in public, except for their metadata which are freely available under the CC0 waiver.
Biodiversity Heritage Library (BHL) is a searchable archive of scanned public domian books and journals. Originally BHL was focusing mostly on the historical biodiversity literature, however now it is possible to incorporate also materials that are still under copyright through agreements with publishers. Pensoft journals harvest the BHL content for mentions of taxon names and display the original sources through the Pensoft Taxon Profile tool. Bibliographical metadata of the articles published in Pensoft's journals are submitted to BHL on the day of publication. On the top of the BHL content, Roderick Page from the University of Glasgow built BioStor as an open source application that searches and displays the BHL articles by article metadata and individual pages.
The Biodiversity Literature Repository (BLR) is an open community repository at Zenodo built by Plazi and Pensoft to archive articles, images and data in the biodiversity domain. Plazi uploads article PDFs and other materials extracted from legacy literature through their GoldenGATE Imagine tool. Pensoft journals are automatically archiving in BLR all biodiversity-related articles, supplementary files and individual images, through Web services, on the day of publication. The uploaded materials are archived at Zenodo under their own DOIs, if exisiting, or are assigned Zenodo DOIs.
The Bibliography of Life (BoL) was created by the EU FP7 project ViBRANT to search, retrieve and store bibliographic references and is currently maitained by Pensoft and Plazi. BoL consists of the search and discovery tool ReFindit and a repository for bibliographic references harvested from the literature, RefBank.
Online publishing allows an author to provide data sets, tables, video files, or other information as supplementary information files associated with papers, or to deposit such files in one of the repositories described above, which can greatly increase the impact of the submission. For larger biodiversity datasets, authors should consider the alternative of submitting a separate data paper (see description below).
Submission of data to a recognised data repository is encouraged as a superior and more sustainable method of data publication than submission as a supplementary information file with an article. Nevertheless, Pensoft will accept supplementary information files if authors wish to submit them with their articles and demonstrate that no suitable repository exists. Details for uploading such files are given in Step 4 of the Pensoft submission process (example from ZooKeys) available through the “Submit a Manuscript” button on any of the Pensoft journal websites.
By default, the maximum file size for each supplementary information file that can be uploaded onto the Pensoft web site is 50 MB. If you need more than that, or wish to submit a file type not listed below, please contact Pensoft's editorial office before uploading.
When submitting a supplementary information file, the following information should be completed:
All supplementary information files should be referenced explicitly by file name within the body of the article, e.g. “See Supplementary File 1: Movie 1 for a recording of the original data used to perform this analysis”.
The ARPHA Writing Tool and the journals currently based on it (Biodiversity Data Journal, Research Ideas and Outcomes, One Ecosystem, and BioDiscovery) provide the functionality to cite the supplementary materials through in-text citations in the same way as figures, tables or references are cited.
Ideally, the supplementary information file formats should not be platform specific, and should be viewable using free or widely available tools. Suitable file formats are:
For supplementary documentation:
For animations:
For images:
For movies:
For datasets:
The file names should use the standard file extensions (as in “Supplementary-Figure-1.png”). Please also make sure that each supplementary information file contains one particular data type, or is of a single table, figure, image or video.
To facilitate comparisons between different pieces of evidence, it is common to produce composite figures or to concatenate originally separate recordings into a single audio or video file. We do not recommend such practice, since it is often simpler to just open the two (or more) raw files in question and to appreciate and manipulate them side by side, and such concatenation is a barrier to re-use. Likewise, we do not recommend to provide metadata in non-editable ways (e.g., adding a letter or an arrow into bitmap images or video frames), which complicates re-use too (e.g. translation into another language, or zooming in for additional details).
Best practice recommendations
This specific functionality is available in the ARPHA Writing Tool (AWT) and currently being used in the "Materials" subsection of the "Taxon treatment" section in the "Taxonomic paper" template of the Biodiversity Data Journal. Darwin Core compliant specimen records can be imported into structured format in the manuscript text in three ways:
While the first two methods of data import speak for themselves and one could easily implement them following the instructions on the user interface, the third one deserves a more detailed description, as it is still unique in the data publishing landscape.
The workflow has been thoroughly described from the user's perspective in a blog post and in the paper of
What is a data paper
A data paper is a scholarly journal publication whose primary purpose is to describe a dataset or a group of datasets, rather than to report a research investigation (
The description should include several important elements (usually called metadata, or “description of data”) that document, for example, how the dataset was collected, which taxa it covers, the spatial and temporal ranges and regional coverage of the data records, provenance information concerning who collected and who owns the data, details of which software (including version information) was used to create the data, or could be used to view the data, and so on.
Most Pensoft journals welcome submission and publication of data papers, that can be indexed and cited like any other research article, thus bringing registration of priority, a permanent publication record, recognition, and academic credit to the data creators. In other words, the data paper is a mechanism to acknowledge efforts in authoring ‘fit-for-use’ and enriched metadata describing a data resource. The general objective of data papers in biodiversity science is to describe all types of biodiversity data resources, including environmental data resources.
An important feature of data papers is that they should always be linked to the published datasets they describe, and that link (a URL, ideally resolving a DOI) should be published within the paper itself. Conversely, the metadata describing the dataset held within data archives should include the bibliographic details of the data paper once that is published, including a resolvable DOI. Ideally, the metadata should be identical in the two places — the data paper and the data archive — although this may be difficult to achieve with some archive metadata templates, so that there may be two versions of the metadata. This is why referring to the the data paper DOI is so important.
How to write and submit a data paper
In principle, any valuable dataset hosted in a trusted data repository can be described in a data paper and published following these Guidelines. Each data paper consists of a set of elements (sections), some of which are mandatory and some not. An example of such a list of elements needed to describe primary biodiversity data is available in the section data papers Describing Primary Biodiversity Data below.
Sample data papers which can be used as illustration of the concept can be downloaded from several Pensoft journals, for example, ZooKeys (examples), or Biodiversity Data Journal (examples).
All claims in a data paper should be substantiated by the associated data. If the methodology is standard, please explain in what respects your data are unique and merit a publication in the form of a data paper.
Alternatively, if the methodology used to acquire the data differs significantly from established approaches, please consider submitting your data to an open repository and associating them with a standard or data paper, in which these methodologies can be more fully explained.
At the time of submission of the data paper manuscript, the data described should be freely available online in a public repository under a suitable data license, so that they can be peer-reviewed, retrieved anonymously for re-use, resampling and redistribution by anyone for any purpose, subject to one condition at most — that of proper attribution using scholarly norms (see the Data Publishing Licenses and How to Cite Data sections, above). The repository, or at least one public mirror thereof, should not be under the control of the submitting authors. The relevant data package DOIs or accession numbers, as well as any special instructions for acquiring and re-publishing the data, should be included in the submitted data paper manuscript.
The procedures for data retrieval should be described, along with the mechanisms for updating and correcting information. This can be achieved by referencing an existing description if that is up to date, citable in its exact version, and publicly accessible on the web.
All methodological details necessary to replicate the original acquisition of the raw data have to be included in the data paper, along with a description of all data processing steps undertaken to transform the raw data into the form in which the data have been deposited in the repository and presented in the paper. Authors should discuss any relevant sources of error and how these have been addressed.
In addition to data papers describing new data resources, data papers describing legacy data are also welcome, as long as the current version of these is publicly accessible and can be cited. If possible, authors should outline possible re-use cases, taking into account that future uses of the data might involve researchers from different backgrounds. We encourage the provision of tools to facilitate visualization and re-use of the data.
For primary biodiversity (species-by-occurrence) data, authors are strongly encouraged to use the data publishing workflow of the GBIF Integrated Publishing Toolkit (IPT), described below. From IPT, data manuscripts can be generated in rich text format (RTF) directly from the metadata (Fig.
A more universal and innovative approach is conversion of the Ecological Metadata Language (EML) file available from IPT or other data platforms, such as DataONE or LTER, into data paper manuscripts in the ARPHA Writing Tool (Fig.
Primary biodiversity data as defined by GBIF are "Digital text or multimedia data records detailing facts about the instance of occurrence of an organism, i.e. on the what, where, when, how and by whom of the occurrence and the recording".
Currently, the majority of primary biodiversity data consists of species-by-occurrence data records available from published sources and/or natural history collections. Other types of primary biodiversity data that merit publication are observational data and multimedia resources in biodiversity.
Authoring metadata through the GBIF Integrated Publishing Toolkit (IPT)
The GBIF Integrated Publishing Toolkit (IPT) facilitates authoring of metadata based on the GBIF Metadata Profile (GMP) that was developed to standardise how biodiversity data resources are described for discovery through the GBIF network. For further information, see the GBIF Metadata Profile, Reference Guide and GBIF Metadata Profile, How-to guide.
The GMP conforms to the Ecological Metadata Language (EML) specification with some additional terms drawn from the Natural Collections Descriptions (NCD) set of terms for describing natural history collections and the ISO 19139: North American Profile of ISO 19115:2033 — Geographic Information — Metadata. The GMP elements, together with their descriptions, are listed below.
The structure of a Data Paper largely resembles that of a standard research paper. However, it must contain several specific elements. These elements are listed in Table
Structure of a data paper and its mapping from GBIF IPT Metadata Profile elements.
Section/Sub-Section headings of the data paper describing primary biodiversity data |
Mapping from GBIF IPT Metadata Profile elements, and formatting instructions |
<TITLE> |
Derived from the ‘title’ element. Format: a centred sentence without a full stop (.) at the end. |
<Authors> |
Derived from the ‘creator’, ‘metadataProvider’ and ‘AssociatedParty’ elements. From these elements, combinations of ‘first name’ and ‘last name’ are derived, separated by commas(,). Corresponding affiliations of the authors are denoted with numbers (1, 2, 3,...) superscripted at the end of each last name. If two or more authors share the same affiliation, it will be denoted by use of the same superscript number. Format: centred. |
<Affiliations> |
Derived from the ‘creator’, ‘metadataProvider’ and ‘AssociatedParty’ elements. From these elements, combinations of ‘Organisation Name’, ‘Address’, ‘Postal Code’, ‘City’, ‘Country’ constitute the affiliation. |
<Corresponding authors> |
Derived from the ‘creator’ and ‘metadataProvider’ elements. From these elements, ‘first name’, ‘last name’ and ‘email’ are derived. Email addresses are written in parentheses (). In a case of more than one corresponding author, these are separated by commas. If both creator and metadataProvider is the same, the creator is denoted as the corresponding author. Format: indented from both sides. |
<Received, Revised, Accepted, and Published dates> |
These will be inserted manually by the Publisher of the data paper, to indicate the dates of original manuscript submission, revised manuscript submission, acceptance of manuscript and publication of the manuscript as a data paper in the journal. |
<Citation> |
This will be inserted manually by the Publisher of the data paper. It will be a combination of Authors, Year of data paper publication (in parentheses), Title, Journal Name, Volume, Issue number (in parentheses), and DOI of the data paper, in both native and resolvable HTTP format. |
<Abstract> |
Derived from the ‘abstract’ element. Format: indented from both sides. |
<Keywords> |
Derived from ‘keyword’ element. Keywords are separated by commas (,). |
<Introduction> |
Free text. |
<Taxonomic Coverage> |
Derived from the Taxonomic Coverage elements. These elements are ‘general taxonomic coverage description’, ‘taxonomicRankName’, ‘taxonomicRankValue’ and ‘commonName’. ‘TaxanomicRankName’ and ‘taxonomicRankValues’. |
<Spatial Coverage> |
Derived from the Spatial Coverage elements. These elements are ‘general geographic description’, ‘westBoundingCoordinate’, ‘eastBoundingCoordinate’, ‘northBoundingCoordinate’, ‘southBoundingCoordinate’. |
<Temporal Coverage> |
Derived from the Temporal Coverage elements namely, ‘beginDate’ and ‘endDate’. |
<Project Description> |
Derived from project elements as described in the GBIF Metadata Profile. These elements are ‘title’ of the project, ‘personnel’ involved in the project, ‘funding sources’, ‘StudyAreaDescription/descriptor’, and ‘designDescription’. |
<Natural Collections Description> |
Derived from project NCD elements as described in the GBIF Metadata profile. These elements are ‘parentCollectionIdentifier’, ‘collectionName’, ‘collectionIdentifier’, formationPeriod’, ‘livingTimePeriod’, ‘specimenPreservationMethod’, and ‘curatorialUnit’. |
<Methods> |
Derivedfrom methods elements as described in the GBIF Metadata Profile. These elements are ‘methodStep/description’, ‘Sampling/StudyExtent/ description’, ‘sampling/samplingDescription’, and ‘qualityControl/ description’. |
<Dataset descriptions> |
Derived from physical and other elements as described in the GBIF Metadata Profile. These elements are ‘objectName’, ‘characterEncoding ’,‘formatName’, ‘formatVersion’, ‘distribution/online/URL’ ,‘pubDate’, ‘language’, and ‘intellectualRights’. |
<Additional Information> |
Derived from ‘additionalInfo’ element. |
<References> |
Derived from ‘citation’ element. This element assumes a reference to a research article or a web link, cited in the metadata description. |
The GBIF Integrated Publishing Toolkit (IPT) makes it easy to share different types of biodiversity-related information: primary taxon occurrence data (also known as primary biodiversity data), taxon checklists, sample-based data, and general metadata about data sources. An IPT instance, as well as the data and metadata registered through the IPT, is connected to the GBIF Registry, indexed for access via the GBIF network and portal, and made accessible for public use.
The IPT is a server-side software tool that allows users to author metadata, map databases or upload text files that conform to the Darwin Core standard, to install extensions and vocabularies to allow for richer content and, ultimately, to register datasets for publication and sharing through GBIF. IPT operators undertake the responsibility of running an Internet server which should be maintained, namely, that it should remain online and be addressable. Any set of metadata can be downloaded from any IPT (version 2.0.2+) into RTF format in the form of a data paper manuscript (Fig.
Therefore, data authors have the following options:
GBIF provides a list of existing IPT installations supporting the authoring of data papers and a user manual for the IPT.
Once you have decided to publish your data and generate a data paper manuscript through the GBIF IPT, please consider the following simple rules:
Generation of data paper manuscripts in RTF using the GBIF IPT
As described in the previous section, data creators will be able to author data paper manuscripts in various ways. However, to lower the technical barrier and make the process easy to adopt, a conversion tool to automatically export metadata to an RTF manuscript is available in IPT 2.0.2+. The step-by-step process for generating a data paper manuscript from the metadata is described below:
Automated generation of data paper manuscripts from Ecological Metadata Language (EML) files
An innovative approach, similar to the that which converts EML metadata into RTF, is the direct conversion of an EML file (supported versions 2.2.0 and 2.2.1) downloaded from GBIF IPT (Fig.
Metadata descriptions of primary biodiversity data used in the GBIF Metadata Proifile (GMP) and the Integrated Publishing Toolkit (IPT) are based primarily on the Ecological Metadata Language (EML) Specification. Therefore, the same basic elements and the overall data paper structure explained in the previous section can also be used to describe ecological and environmental data. As a result, data papers for ecological and environmental data will have a basic structure similar to that of papers on primary biodiversity data. Authors are encouraged to include additional elements (sections) in the manuscripts if they expect this to improve the description of the specifics of their environmental and ecological data. The main difference is that ecological and environmental data cannot be processed through the GBIF IPT and hence they should be deposited in another public data hosting centre listed in the section Open Data Repositories, for example DataONE, LTER Network, PANGAEA or Dryad.
Authors intending to publish data papers describing ecological and environmental data are advised to use the following steps:
Alternatively, EML metadata files (versions 2.2.0 and 2.2.1) hosted in DataONE and LTER can automatically be converted into а data paper manuscript using the ARPHA Writing Tool import workflow described in the previous section (see also
Pensoft journals require, as a condition for publication, that genome data supporting the results in the paper should be archived in an appropriate public archive, and accession numbers must be included in the final version of the paper. Sufficient additional metadata (such as sample locations, individual identities, etc.) should also be provided to allow easy repetition of analyses presented in the paper. For best practice in following community metadata standards, see the many data-type specific standards and checklists provided by the Genomic Standards Consortium (particularly the MIxS standards, as described by
DNA sequence data should be archived in GenBank or another public database of the INSDC consortium. Expression data should be submitted to the Gene Expression Omnibus or an equivalent database, whereas phylogenetic trees should be submitted to TreeBASE. More idiosyncratic data, such as microsatellite allele frequency data, can be archived in a more flexible digital data repository such as Dryad or Knowledge Network for Biocomplexity (KNB).
Barcode Data Release Papers
Barcode-of-Life COI (mitochondrial encoded cytochrome oxidase 1) genome data can be published in a form of a Data Paper, as has been announced by the Consortium for the Barcode of Life (CBOL) and illustrated by some published sample papers (
Definition: A BARCODE Data Release Paper is a short manuscript that announces and documents the public release to a member of the International Nucleotide Sequence Data Collaboration (INSDC, which includes GenBank, ENA,and DDBJ) of a significant body of data records that meet the BARCODE data standards.
Contents: BARCODE Data Release Papers are meant to announce and document the public availability of a significant body of new DNA barcodes. The barcode records should therefore be a coherent set of records that provides noteworthy new research capabilities for a taxonomic group, ecological assemblage or specified geographic region. Authors should explain the rationale for creating a comprehensive library of BARCODE data for that taxonomic group, ecological habitat, and/or geographic region. If the data have been collected as part of a larger, longer-term research project, the manuscript should explain the wider project and its planned use of the data for taxonomic, biogeographic, evolutionary, and/or applied research, or for other purposes.
The BARCODE Data Release Paper manuscript should describe:
The manuscript should provide summaries of data density and quality such as those shown in Table
Suggested data fields for a BARCODE Data Release Paper
Average number of records per species |
|
Range of records per species | Min-Max |
Average sequence length (and Min/Max) | |
Range of intraspecific variation* | Min-Max |
Median variation within species* | X% |
Range of divergence between closest species-pairs** | Min-Max |
Median divergence between closest species-pairs** |
* Calculated as the arithmetic average of all K2P distances between specimens in each species.
** Closest species pairs refers to each species and the species with which it has the least divergent barcode sequence. The true phylogenetic sister-species may not be included in the dataset, and could have a lower interspecies divergence.
Manuscripts should also include an Appendix with a table that presents:
Review Criteria: In addition to the general Guidelines for Reviewers listed in the next section, CBOL recommends that reviewers use the following evaluation criteria for BARCODE Data Release Papers, and suggests that authors anticipate such evaluation:
An increasing number of software tools also merit description in scholarly publications. The structure of the data paper proposed below for such software tools is largely based on the Description of a Project (DOAP) RDF schema and XML vocabulary developed by Edd Dumbill to describe software projects, in particular those that are open-source. The main difference, however, is that the data paper aims at the description of the software product and not of the software source code; data papers of this kind are addressed mainly to end users of the software and less to developers and software engeneers.
Software citation principles have been developed by the FORCE11 Software Citation Working Group based on an adaptation of the FORCE11 Data Citation Principles. The six principles are abstracted here (
Based on an analysis of several use cases such as publishing a software paper or publishing papers that cite software, basic metadata requirements were identified: unique identifier, software name, author(s), contributor role, version number, release date, location/repository, indexed citations, software license, description, keywords.
While the provision of detailed specifications and recommendations around metadata standards were beyond the scope of the working group, DOAP is mentioned together with some other more recent community initiatives. It is expected that a new working group will take these software citation principles forward by supporting potential implementers and developing metadata standards, following the example of the FORCE11 Data Citation Working Group (
According to DOAP, major properties of a software tool description include elements such as homepage, developer, programming language and operational system. Other properties include: Implements specification, anonymous root, platform, browse, mailing list, category, description, helper, tester, short description, audience, screenshots, translator, module, documenter, wiki, repository, name, repositorylocation, language, service endpoint, created, download mirror, vendor, old homepage, revision, download page, license, bug database, maintainer, blog, file-release, and release.
A basic version of a DOAP description can be generated using an online tool called doapamatic.
A sample structure of a Software Description paper was introduced and used by Pensoft since 2011 (
Metadata elements (based on EML and DOAP) to be included in a data paper describing a software tool
Section/Sub-Section headings of the Software Description paper |
Mapping from the available EML and DOAP metadata elements; a few other elements have been added to provide a better mapping to the data paper structure, with formatting guidelines |
<TITLE> |
Derived from the ‘name’ element. This must be extended to a concise description of the software tool and its implementation, e.g.: “BioDiv, a web-based tool for calculation of biodiversity indexes”. Format: This is a centred sentence without full stop (.) at the end. |
<Authors> |
Derived from the ‘developer’, ‘maintainer’ and eventually ‘helper’, ‘tester’, and ‘documenter’. From these elements, combinations of ‘first name’ and ‘last name’ are derived, separated by commas (,). Corresponding affiliations of the authors are denoted with numbers (1, 2, 3,...) in superscript at the end of each last name. If two or more authors share same affiliation, it will be denoted by use of the same superscript number. Format: centred. |
<Affiliations> |
Derived from the ‘developer’, ‘maintainer’ and ‘helper’. From these elements, combinations of ‘Organisation Name’, ‘Address’, ‘Postal Code’, ‘City’, and ‘Country’ will constitute the affiliation. |
<Corresponding authors> |
Derived from any of the ‘developer’, ‘maintainer’, ‘helper’, ‘tester’, and ‘documenter’ elements. From these elements ‘first name’, ‘last name’ and ‘email’ are derived. Email addresses are written in parentheses (). In case of more than one corresponding author, these are separated by commas. Format: indented from both sides. |
<Received, Revised, Accepted, and Published dates> |
These will be inserted manually by the Publisher of the data paper to indicate the dates of original manuscript submission, revised manuscript submission, acceptance of manuscript and publishing of the manuscript as data paper in the journal. |
<Citation> |
This will be inserted manually by the Publisher of the data paper. It will be a combination of Authors, Year of data paper publication (in parentheses), Title, Journal Name, Volume, Issue number (in parentheses), and DOI of the data paper in both native and resolvable HTTP format. |
<Abstract> |
Derived from the ‘short description’ element. Format: indented from both sides. |
<Keywords> |
Keywords should reflect most important features of the tool and areas of implementation, and should be separated by commas (,). |
<Introduction> |
Free text. |
<Project Description> |
Derived from ‘description’ element; if applicable, it should also include sub-elements such as ‘title’ of the project, ‘personnel’ involved in the project, ‘funding’ sources’, and other appropriate information. |
<Web Location (URIs)> |
Derived from the elements ‘homepage’, ‘wiki’, ‘download page’, ‘download mirror’, ‘bug database’, ‘mailing list’, ‘blog’, ‘vendor’ |
<Technical specification> |
Derived from the elements ‘platform’, ‘programming language’, ‘operational system’ (if OS-specific), ‘language’ , ‘service endpoint’ |
<Repository> |
Derived from the elements ‘repository type’ (CVS, SVN, Arch, BK), ‘repository browse uri’ (CVS, SVN, BK), ‘repository location’) SVN, BK, Arch), ‘repository module’ (CVS, Arch), ‘repository anonymous root’ (CVS) |
<License> |
Derived from the ‘license’ element |
<Implementation> |
Derived from ‘Implements specification’ and ‘audience’ elements; please remember that this section is of primary interest to end users, and should be written in detail, if possible including use cases, citations and links. |
<Additional Information> |
Any kind of helpful additional information may be included. |
<Acknowledgement> |
Lists all acknowledgments at the authors' discretion. |
<References> |
Includes literature references and web links cited in the text. |
Data papers describing data resources — or manuscripts linked to open data resources that underpin the scientific analyses — that are submitted to Pensoft journals will be subjected to peer review according to the respective journal's policies (e.g., conventional pre-publication anonymous, non-anonymous, or entirely open and public, including post-publication review) as a routine method to enhance the completeness, truthfulness and accuracy of the descriptions of the relevant data resources, thereby improving their use and uptake. A specific feature of the ARPHA-XML journal publishing workflow used by the Biodiversity Data Journal, Research Ideas and Outcomes (RIO Journal), One Ecosystem, and others, is the so called pre-submission peer-review which can be organized by the authors or the journal's editorial office still during the authoring process in the ARPHA Writing Tool.
Peer review of data papers is expected to evaluate the completeness and quality of the dataset(s) description (metadata), as well as the publication value of data. This may include the appropriateness and validity of the methods used, compliance with applicable standards during collection, management and curation of data, and compliance with appropriate metadata standards in the description of the data resources. In order to allow for accuracy and usefulness, metadata needs to be as complete and descriptive as possible.
Reviewers will consider the following aspects of (a) the quality of the manuscript, (b) the quality of the data, and (c) the consistency between the description within the data paper and the repository-held metadata relating the data resource itself.
Peer review of the data is rather problematic in the current scholarly publishing practice. There are several reasons for that:
Several Pensoft journals offer an additional service for auditing and correcting data, which might be a solution for those authors or their institutions who really care about data quality and re-use.
Best practice recommendations for evaluating data papers or manuscripts that are submitted together with the underlying data are summarised below.
We thank several colleagues who commented or contributed to an earlier version of the draft: Tim Robertson and Kyle Braak (GBIF), Todd Vision and Peggy Schaeffer (Dryad). We also thank all our authors, reviewers, editors and partners for their support in testing and using these data publishing guidelines and workflows. Special thanks are due to Donat Agosty, Terry Catapano, Guido Sautter and Willi Egloff from Plazi (Switzerland) for the years-long successful collaboration, and friendship, and to Robert Mesibov (Tasmania) and Florian Wetzel (Museum für Naturkunde, Berlin) who provided open pre-submission peer reviews to the manuscript.
The present guidelines were elaborated through the FP7 funded project EU BON: Building the European Biodiversity Observation Network, grant agreement ENV30845, and constitute Milestone MS842. V. Senderov's PhD is financed through the EU Marie-Sklodovska-Curie Program BIG4 project, Grant Agreement Nr. 642241.