Corresponding author: Lyubomir Penev (
Academic editor:
The present paper describes policies and guidelines for scholarly publishing of biodiversity and biodiversity-related data, elaborated and updated during the Framework Program 7 EU BON project, on the basis of an earlier version published on Pensoft's website in 2011. The document discusses some general concepts, including a definition of datasets, incentives to publish data and licenses for data publishing. Further, it defines and compares several routes for data publishing, namely as (1) supplementary files to research articles, which may be made available directly by the publisher, or (2) published in a specialized open data repository with a link to it from the research article, or (3) as a data paper, i.e., a specific, stand-alone publication describing a particular dataset or a collection of datasets, or (4) integrated narrative and data publishing through online import/download of data into/from manuscripts, as provided by the Biodiversity Data Journal.
The paper also contains detailed instructions on how to prepare and peer review data intended for publication, listed under the Guidelines for Authors and Reviewers, respectively. Special attention is given to existing standards, protocols and tools to facilitate data publishing, such as the Integrated Publishing Toolkit of the Global Biodiversity Information Facility (GBIF IPT) and the DarwinCore Archive (DwC-A).
A separate section describes most leading data hosting/indexing infrastructures and repositories for biodiversity and ecological data.
The present guidelines were elaborated through the FP7 funded project
Data publishing in this digital age is the act of making data available on the Internet, so that they can be downloaded, analysed, re-used and cited by people and organisations other than the creators of the data (
Data hosting, long-term preservation and archiving
Documentation and metadata
Citation and credit to the data authors
Licenses for publishing and re-use
Data interoperability standards
Format of published data
Software used for creation and retrieval
Dissemination of published data
The present guidelines are based on an earlier version published in PDF on Pensoft's website in 2011 (
The FORCE11 group dedicated to facilitating change in knowledge creation and sharing, recognising that data should be valued as publisheable and citable products of research, has developed a set of principles for publishing and citing such data. The
Data should be
Data should be
Data should be
Data should be
A key outcome of
The Research Data Alliance (RDA) promotes the open sharing of data by building upon the underlying social and technical infrastructure. Established in 2013 by the European Union, the National Science Foundation and the National Institute of Standards and Technology (USA) as well as the Department of Innovation (Australia), it has grown to include some 4,200 members from 110 countries who collaborate through Work and Interest Groups "to develop and adopt infrastructure that promotes data-sharing and data-driven research, and accelerate the growth of a cohesive data community that integrates contributors across domain, research, national, geographical and generational boundaries" (
Data Description Registry Interoperability Model
Persistent Identifier Type Registry
Workflows for Research Data Publishing: Models and Key Components
Bibliometric Indicators for Data Publishing
Dynamic Data Citation Methodology
Repository Audit and Certification Catalogues
One RDA output, the
Within RDA, a
With regard to biodiversity, some recently published papers emphasise the importance of publishing of biodiversity data (
The
The present paper outlines the strategies and guidelines needed to support the scholarly publishing and dissemination of biodiversity data, that is publishing through the academic journal networks.
A dataset is understood here as a digital collection of logically connected facts (observations, descriptions or measurements), typically structured in tabular form as a set of records, with each record comprising a set of fields, and recorded in one or more computer data files that together comprise a data package. Certain types of research datasets, e.g., a video recording of animal behaviour, will not be in tabular form, although analyses of such recordings may be. Within the domain of biodiversity, a dataset can be any discrete collection of data underlying a paper – e.g., a list of all species occurrences published in the paper, data tables from which a graph or map is produced, digital images or videos that are the basis for conclusions, an appendix with morphological measurements, or ecological observations.
More generally, with the development of XML-based publishing technologies, the research and publishing communities are coming to a much wider definition of data, proposed in the BioMed Central (BMC) position statement on open data: "the raw, non-copyrightable facts provided in an article or its associated additional files, which are potentially available for harvesting and re-use" (
As these examples illustrate, while the term "dataset" is convenient and widely used, its definition is vague. Data repositories such as
For practical reasons, we propose a clear distinction between static data that represent specific completed compilations of data upon which the analyses and conclusions of a given scientific paper may be based, and curated data that belong to a large data collection (usually called a "database") with ongoing goals and curation, for example the various bioinformatics databases that curate ever growing amounts of nucleotide sequences (
Curated data, on the other hand, are usually hosted on external servers or in data hosting centres. A primary goal of the data publishing process in this case is to guarantee that these data are properly described, up to date, available to others under appropriate licensing schemes, peer-reviewed, interoperable, and where appropriate linked from a research article or a data paper at the time of publication. Especially in cases where the long-term viability of the curated project may be insecure (e.g. in the case of grant funded projects) (
Data publishing has become increasingly important and already affects the policies of the world's leading science funding frameworks and organizations — see for example the
There is a widespread conviction that data produced using public funds should be regarded as a common good, and should be openly published and made available for inspection, interpretation and re-use by third parties.
Open data increases transparency and the overall quality of research; published datasets can be re-analyzed and verified by others.
Published data can be cited and re-used in the future, either alone or in association with other data.
Open data can be integrated with other datasets across both space and time.
Data integration increases recognition and opportunities for collaboration.
Open data increases the potential for interdisciplinary research, and for re-use in new contexts not envisaged by the data creator.
Needless duplication of data-collecting efforts and associated costs will be reduced.
Published data can be indexed and made discoverable, browsable and searchable through internet services (e.g. Web search engines) or more specific infrastructures (e.g., GBIF for biodiversity data).
Collection managers can trace usage and citations of digitized data from their collections.
Data creators, and their institutions and funding agencies, can be credited for their work of data creation and publication through the conventional channels of scholarly citation; priority and authorship is achieved in the same way as with a publication of a research paper.
Datasets and their metadata, and any related data papers, may be inter-linked into research objects, to expedite and mutually extend their dissemination, to the benefit of the authors, other scientists in their fields, and society at large.
Published data may be structured as "Linked Data", by which term is meant data accessible using RDF, the
There are four main routes for scholarly publication of data, most of which are available with various journals and publishers:
Supplementary files underpinning a research paper and available from the journal's website.
Data hosted at external repositories but linked back from the research article it underpins.
Stand-alone description of the data resource in the form of scholarly publication (e.g., Data Paper, or Data Note - see, for example,
Data published within the article text and downloadable from there in the form of structured data tables or as a result of text mining. This "integrated data publishing" approach has been implemented by the
Within these main data publishing modes, Pensoft developed a specific set of applications designed to meet the needs of the biodiversity community. Most of these were implemented in the Biodiversity Data Journal and its associated
Import of primary biodiversity data from Darwin Core compliant spreadsheets, or manually via a Darwin Core editor, into manuscripts and their consequent publication in a structured and downloadable format (
Direct online import of Darwin Core compliant primary biodiversity data from
Import of multiple occurrence records of voucher specimens associated with a particular Barcode Index Number (BIN) (
Automated generation of data paper manuscripts from Ecological Metadata Language (EML) metadata files stored at
Automated export of the occurrence data published in BDJ into
Automated export of the taxonomic treatments published in BDJ into Darwin Core Archive. The DwC-A is freely available for download from each article that contains taxonomic treatments data.
Novel article types in the ARPHA Writing Tool and its associated journals (Biodiversity Data Journal, Research Ideas and Outcomes (RIO Journal), and One Ecosystem): Monitoring Schema, IUCN Red List compliant Species Conservation Profile (
Nomenclatural acts modelled and developed in BDJ as different types of taxonomic treatments for plant taxonomy.
Markup and display of biological collection codes against the
Workflow integration with the
Workflow integration with the
Automated archiving of all articles published in Pensoft's journals in the
For any form of data publishing, follow the
Follow the
Deposition of data in an established international repository is always to be preferred to supplementary files published on a journal's website.
Smaller data files, especially those directly underpinning an article, should also be deposited at a data repository and linked from the article. We recommended, however these to be published also as supplementary file(s) to the related article, to ensure an additional joint preservation and presentation of the article together with its associated data.
If a specialized and well establisdhed repository for a given kind of data exists, it should be preferred over non-specialized ones (see also section "Data Deposition in Open Repositories" below for finer detail), for example: Primary biodiversity data (species-by-occurrence) records should be deposited through the Sample-based biodiversity data (e.g., species abundances from monitoring or inventory studies) should be deposited through the Genomic data should be deposited at any of the three Barcoding and metabarcoding data should be deposited at the Metagenomic data should be deposited at Protein sequence data should be deposited at X-ray microtomography (micro-CT) scans should be deposited at Phylogenetic data should be deposited at
Heterogeneous datasets, or data packages containing various data types should be deposited in generalist repositories, for example
Repositories not mentioned above or in the "Data Deposition in Open Repositories" section below, may be used at the discretion of the author, if they provide long-term preservation of various data types, persistent identifiers to datasets, discoverability, open access to the data, and well proven sustainablility record.
Digital Object Identifiers (DOIs)
Exceptional cases when publication of data is not possible, or some of the data remain closed or obfuscated, should be discussed with the publisher in advance. In such cases, the authors should provide an open statement explaining why restrictions in open data publishing are needed to be put in force. The author's statement should be published together with the article.
This section originates from a
The well-established norm for citing genetic data, for example, is that one simply cites the GenBank identifier (accession number) in the text. Similar usage is also commonplace for items in other bioinformatics databases. The latest developments in the implementation of the data citation principles, however, strongly recommend references to data to be included in the reference lists, similarly to literature references (
For such data in data repositories, each published data package and each published data file should always be associated with a persistent unique identifier. A Digital Object Identifier (DOI) issued by
Data citations may relate either to the author's own data, or to data created and published by others ("third-party data"). In the former case, the dataset may have been previously published, or may be published for the first time in association with the article that is now citing it. All these types of data should, for consistency, be cited in the same manner.
As is the norm when citing another research article, any citation of a data publication, including a citation of one's own data, should always have two components:
An
A formal
We recommend that the in-text citation statement also contains a separate citation of the research article in which the data were first described, if such an article exists, with its own in-text reference pointer to a formal article reference in the paper's reference list, unless the paper being authored is the one providing that first description of the data. If the in-text citation statement includes the DOI for the data (a strongly desirable practice), this DOI should always be presented as a dereferenceable URI, as shown below. Further to this, both DataCite and CrossRef recommend displaying DOIs within references as full URLs, which serve a similar function as a journal volume, issue and page number do for a printed article, and also give the combined advantages of linked access and the assurance of persistence (
For example, Dryad recommends to cite always both the article in association with which data were published and the data themselves (Fig.
The data reference in the article's reference list should contain the minimal components recommended by the FORCE11 Data Citation Synthesis Group (
Author(s)
Year
Dataset Title
Data Repository or Archive
Global Persistent Identifier
Version, or Subset, and/or Access Date
These components should be presented in whatever format and punctuation style the journal specifies for its references.
The following example demonstrates in general terms what is required.
“This paper uses data from the [
Jones A, Bloggs B, Smith C (2008a). <Title of data package>. <Repository name>. doi: https://doi.org/#####. [Version and/or date of access].
Jones A, Saul D, Smith C (2008b). <Title of journal article>. <Journal> <Volume>: <Pages>. doi: https://doi.org/#####.
Note that the authorship and the title of the data package may, for valid academic reasons, differ from those of the author's paper describing the data: indeed, to avoid confusion of what is being referenced, it is highly desirable that the titles of the data package and of the associated journal article are clearly different.
1. When referring to the author's own
The citation statement of data deposition should be included in the body of the paper, in a
In addition, the formal data reference should be included in the paper's reference list, using the recommended journal's reference format.
The following example demonstrates what is required.
“The data underpinning the analysis reported in this paper were deposited in the Dryad Data Repository at
AND/OR
"The data underpinning the analysis reported in this paper were deposited in the Global Biodiversity Information Facility (GBIF) at
Macías-Hernández N, de la Cruz López S, Roca-Cusachs M, Oromí P, Arnedo MA (2016) Data from: A geographical distribution database of the genus Dysdera in the Canary Islands (Araneae, Dysderidae). Dryad Digital Repository.
AND/OR
Feher Z, Szekeres M (2016): Geographic distibution of the rock-dwelling door-snail genus Montenegrina Boettger, 1877 (Mollusca, Gastropoda, Clausiliidae). v1.5. ZooKeys. Dataset/Occurrence deposited in the GBIF. doi: https://doi.org/10.15468/###### OR
2. When acknowledging re-use in the paper of
A statement of usage of the previously published data, with citation of the data source(s) and of the related journal article(s), should be placed in a separate section named
In addition, the formal data reference and a formal reference to the related journal article should be included in the paper's reference list, using the recommended journal's reference format.
The following example demonstrates what is required.
“The data underpinning this analysis were obtained from the Dryad Data Repository at
Macías-Hernández N, de la Cruz López S, Roca-Cusachs M, Oromí P, Arnedo MA (2016) A geographical distribution database of the genus Dysdera in the Canary Islands (Araneae, Dysderidae). ZooKeys 625: 11-23.
Macías-Hernández N, de la Cruz López S, Roca-Cusachs M, Oromí P, Arnedo MA (2016) Data from: A geographical distribution database of the genus Dysdera in the Canary Islands (Araneae, Dysderidae). Dryad Digital Repository.
3. When acknowledging re-use of
A statement of usage of previously published data, with citation of the data source(s), should be placed in a separate section named
In addition, the formal data reference should be included in the paper's reference list, using the recommended journal's reference format for data citation.
The following real example demonstrates what is required.
“The present paper used data deposited by the Zoological Institute of the Russian Academy of Sciences in the Global Biodiversity Information Facility (GBIF) at
Volkobitsh M, Glikov A, Khalikov R (2017) Catalogue of the type specimens of Polycestinae (Coleoptera: Buprestidae) from research collections of the Zoological Institute, Russian Academy of Sciences. Zoological Institute, Russian Academy of Sciences, St. Petersburg, deposited in GBIF.
One of the basic postulates of the
When publishing data, make an explicit and robust statement of your wishes regarding re-use.
Use a recognized waiver or open publication license that is appropriate for data.
If you want your data to be effectively used and added to by others, it should be fully "open" as defined by the
Explicit dedication of data underlying published science into the public domain via PDDL or CC-Zero is strongly recommended and ensures compliance with both the
A domain-specific implementation of the open access principles for biodiversity data was elaborated during the EU project
Promoting the understanding that primary biodiversity data are facts and therefore NOT a subject of copyright; they belong to the public domain, independent of their source;
Requiring explicit statements that clearly place biodiversity data in the public domain, by applying a standardised waiver for any eventual copyright or database protection right, for example
To the maximum possible extent, rendering printed materials, PDFs, and other non-machine-actionable biodiversity data and narratives, into machine-readable and harvestable formats.
In practice, a variety of waivers and licenses exist that are specifically designed for and appropriate for the treatment of data, as listed in Table
The default data publishing license used by Pensoft is the
As an alternative, the other licenses or waivers, namely the
Publication of data under a waiver such as
The Attribution-ShareAlike
Many widely recognized open access licenses are intended for text-based publications to which copyright applies, and are not intended for, and are not appropriate for, data or collections of data which do not carry copyright. Creative Commons licenses apart from CC-Zero waiver (e.g.,
Authors should explicitly inform the publisher if they want to publish data associated with a Pensoft journal article under a license that is different from the
Any set of data published by Pensoft, or associated with a journal article published by Pensoft, must always clearly state its licensing terms in both a human-readable and a machine-readable manner.
Where data are published by a public data repository under a particular license, and subsequently associated with a Pensoft research article or data paper, Pensoft journals will accept that repository license as the default for the published datasets.
Images, videos and similar "artistic works" are usually covered by copyright "automatically", unless specifically placed in the public domain by use of a public domain waiver such as
Databases can contain a wide variety of types of content (images, audiovisual material, and sounds, for example, as well as tabular data, which might all be in the same database), and each may have a different license, which must be separately specified in the content metadata. Databases may also automatically accrue their own rights, such as the
Open data repositories (public databases, data warehouses, data hosting centres) are subject- or institution-oriented infrastructures, usually based at large national or international institutions. These provide data storage and preservation according to widely accepted standards, and provide free access to their data holdings for anyone to use and re-use under the minimum requirement of attribution, or under an open data waiver such as the
Advantages of depositing data in internationally recognised repositories include:
Visibility: Making your data available online (and linking it to the publication) provides an independent way for others to discover your work.
Citability: all data you deposit will receive a persistent, resolvable identifier that can be used in a citation, as well as listed on your CV.
Workload reduction: if you receive individual requests for data, you can simply direct them to the files in Dryad.
Preservation: your data files will be safely archived in perpetuity.
Impact: other researchers have more opportunities to use and cite your work.
There are several directories of data repositories relevant to biodiversity and ecological data, such as
A very useful resource that puts together information on journal data policies, repositories, and standards grouped by domain, type of data, and organisation is
Such repositories could be used to host data associated with a published data paper, as explained below. For their own data, authors are advised to use an internationally recognised, trusted (normally ISO-certified), specialized repository (see
There are several aggregators and registries of taxonomic data, which differ in their content, policies and methods of data submisison.
The
GBIF is not a repository in the strict sense, but a distributed network of data publishers and local data hosting centres that publish data based on community-agreed standards for exchange/sharing of primary biodiversity data. At a global scale, discovery and access to data is facilitated through the
The
One or more data files keeping all records of the particular dataset in a tabular format such as a comma-separated or tab-separated list;
The archive descriptor (meta.xml) file describing the individual data file columns used, as well as their mapping to DwC terms; and
A metadata file describing the entire dataset which GBIF recommends to be based on
The format is defined in the
GBIF has produced a series of documents and supporting tools that focus primarily on data publishing using the Darwin Core standard. Guides are available for publishing:
Besides the GBIF Integrated Publishing Toolkit, there are two additional tools developed for producing Darwin Core Archives:
The
The
The
As of version 2.2, the
Images can be deposited at generic repositories, such as
There are relatively few repositories dealing with phylogentic data, of which we recommend the following:
Pensoft journals collaborate with four repositories for genomic data, albeit with the assumption that no matter where gene sequence data will be deposited, they should finally be submitted also to
The
Always aim at depositing data before submission of the manuscript, so that they can be linked to and from the manuscript and are made freely available for peer-review. Even if not yet public during the review process, reviewer access is available via NCBI.
Gene sequences should always be published in
A paper dealing with gene sequences should always contain the GenBank accession numbers, and where possible should use the
When including gene sequences deposited in other repositories, authors should provide hyperlinked identifiers (e.g. accession numbers) of those records in the manuscript text.
It is strongly recommended to publish large genomic databases, or separate species genomes, or barcode reference libraries in the form of data papers or "BARCODE data release papers". A BARCODE data release paper is a short manuscript that announces and documents the public deposit to a member of the INSDC of a significant body of data records that meets the
Metabolomics data should be deposited in any of the member databases of the
Proteomics data should be deposited in any of the members of the
Pensoft encourages authors to deposit data underlying biological research articles in the
Pensoft supports
Data deposition in
Data can be deposited with
The data deposition at
Once you deposit your data package, it receives a unique and stable identifier, namely a DataCite DOI. Individual data files within this package are given their own DOIs, based on the package DOI, as do subsequent versions of these data files, as explained under
More information about depositing data in
You may wish to take a look at some example data packages in
Data deposited in
The repository allows non-open-access materials to be uploaded but not displayed in public, except for their metadata which are freely available under the
The
The
Online publishing allows an author to provide data sets, tables, video files, or other information as supplementary information files associated with papers, or to deposit such files in one of the repositories described above, which can greatly increase the impact of the submission. For larger biodiversity datasets, authors should consider the alternative of submitting a separate data paper (see description below).
Submission of data to a recognised data repository is
By default, the maximum file size for each supplementary information file that can be uploaded onto the Pensoft web site is 50 MB. If you need more than that, or wish to submit a file type not listed below, please contact Pensoft's editorial office before uploading.
When submitting a supplementary information file, the following information should be completed:
File format (including name and a URL of an appropriate viewer if the format is unusual).
Title of the supplementary information file (the authorship will be assumed to be the same as for the paper itself, unless explicitly stated otherwise).
Description of the data, software listings, protocols or other information contained within the supplementary information file.
All supplementary information files should be referenced explicitly by file name within the body of the article, e.g. “See Supplementary File 1: Movie 1 for a recording of the original data used to perform this analysis”.
The
Ideally, the supplementary information file formats should not be platform specific, and should be viewable using free or widely available tools. Suitable file formats are:
For supplementary documentation:
RTF (Rich Text Format)
PDF (Adobe Acrobat; ISO 32000-1)
HTML (Hypertext Markup Language)
XML (Extensible Markup Language)
For animations:
SWF (Shockwave Flash)
DHTML (Dynamic HTML)/HTML5
For images:
SVG (Scalable Vector Graphics)
GIF (Graphics Interchange Format)
JPEG/JFIF (JPEG File Interchange Format)
PNG (Portable Network Graphics)
TIFF (Tagged Image File Format)
For movies:
MOV (QuickTime)
MPG (MPEG)
OGG (an open and free multimedia container format)
WebM (an open and free multimedia container format)
For datasets:
CSV (Comma separated values)
TSV (Tab separated values)
The file names should use the standard file extensions (as in “Supplementary-Figure-1.png”). Please also make sure that each supplementary information file contains one particular data type, or is of a single table, figure, image or video.
To facilitate comparisons between different pieces of evidence, it is common to produce composite figures or to concatenate originally separate recordings into a single audio or video file. We do
Open data formats should be preferred over proprietary ones (for example, for spreadsheets, CSV should always be preferred over XLS).
Always follow community-accepted standards within the respective scientific domain (if such exist) when formatting data files, because this will make your data interoperable with other data in the same domain.
To maximise interoperability, plain-text data files should be UTF-8 encoded with no embedded line breaks.
For species-by-occurrence data, the authors are strongly encouraged to publish these through the
For species-by-occurrence data published as supplementary files to the article, authors should use a Darwin Core compliant spreadsheets or tabular text files (
This specific functionality is available in the
manually through the Darwin Core compliant HTML editor embedded in the AWT,
from a Darwin Core compliant spreadsheet template (for example, from an Excel spreadsheet; the template is available in the AWT through the link
automatically, through web services from online biodiversity data platforms (GBIF, Barcode of Life, iDigBio, and PlutoF).
While the first two methods of data import speak for themselves and one could easily implement them following the instructions on the user interface, the third one deserves a more detailed description, as it is still unique in the data publishing landscape.
The workflow has been thoroughly described from the user's perspective in a
At one of the supported data portals (
Depending on the portal, the user finds either the occurrence identfier of the specimen, or a database record identifier of the specimen record, and copies that into the respective upload field of the ARPHA system (Fig.
After the user clicks on "Add," a progress bar is displayed, while the specimens are being uploaded as material citations.
The new material citations are rendered in both human- and machine-readable DwC format in the Materials section of the respective Taxon treatment and can be further edited in AWT, or downloaded from there as a CSV file.
A data paper is a scholarly journal publication whose primary purpose is to describe a dataset or a group of datasets, rather than to report a research investigation (
to provide a citable journal publication that brings scholarly credit to data creators,
to describe the data in a structured human-readable form, and
to bring the existence of the data to the attention of the scholarly community.
The description should include several important elements (usually called metadata, or “description of data”) that document, for example, how the dataset was collected, which taxa it covers, the spatial and temporal ranges and regional coverage of the data records, provenance information concerning who collected and who owns the data, details of which software (including version information) was used to create the data, or could be used to view the data, and so on.
Most Pensoft journals welcome submission and publication of data papers, that can be indexed and cited like any other research article, thus bringing registration of priority, a permanent publication record, recognition, and academic credit to the data creators. In other words, the data paper is a mechanism to acknowledge efforts in authoring ‘fit-for-use’ and enriched metadata describing a data resource. The general objective of data papers in biodiversity science is to describe all types of biodiversity data resources, including environmental data resources.
An important feature of data papers is that they should always be linked to the published datasets they describe, and that link (a URL, ideally resolving a DOI) should be published within the paper itself. Conversely, the metadata describing the dataset held within data archives should include the bibliographic details of the data paper once that is published, including a resolvable DOI. Ideally, the metadata should be identical in the two places — the data paper and the data archive — although this may be difficult to achieve with some archive metadata templates, so that there may be two versions of the metadata. This is why referring to the the data paper DOI is so important.
In principle, any valuable dataset hosted in a trusted data repository can be described in a data paper and published following these Guidelines. Each data paper consists of a set of elements (sections), some of which are mandatory and some not. An example of such a list of elements needed to describe primary biodiversity data is available in the section data papers Describing Primary Biodiversity Data below.
Sample data papers which can be used as illustration of the concept can be downloaded from several Pensoft journals, for example, ZooKeys (
All claims in a data paper should be substantiated by the associated data. If the methodology is standard, please explain in what respects your data are unique and merit a publication in the form of a data paper.
Alternatively, if the methodology used to acquire the data differs significantly from established approaches, please consider submitting your data to an open repository and associating them with a standard or data paper, in which these methodologies can be more fully explained.
At the time of submission of the data paper manuscript, the data described should be freely available online in a public repository under a suitable data license, so that they can be peer-reviewed, retrieved anonymously for re-use, resampling and redistribution by anyone for any purpose, subject to one condition at most — that of proper attribution using scholarly norms (see the Data Publishing Licenses and How to Cite Data sections, above). The repository, or at least one public mirror thereof, should not be under the control of the submitting authors. The relevant data package DOIs or accession numbers, as well as any special instructions for acquiring and re-publishing the data, should be included in the submitted data paper manuscript.
The procedures for data retrieval should be described, along with the mechanisms for updating and correcting information. This can be achieved by referencing an existing description if that is up to date, citable in its exact version, and publicly accessible on the web.
All methodological details necessary to replicate the original acquisition of the raw data have to be included in the data paper, along with a description of all data processing steps undertaken to transform the raw data into the form in which the data have been deposited in the repository and presented in the paper. Authors should discuss any relevant sources of error and how these have been addressed.
In addition to data papers describing new data resources, data papers describing legacy data are also welcome, as long as the current version of these is publicly accessible and can be cited. If possible, authors should outline possible re-use cases, taking into account that future uses of the data might involve researchers from different backgrounds. We encourage the provision of tools to facilitate visualization and re-use of the data.
For primary biodiversity (species-by-occurrence) data, authors are
A more universal and innovative approach is conversion of the Ecological Metadata Language (EML) file available from IPT or other data platforms, such as DataONE or LTER, into data paper manuscripts in the ARPHA Writing Tool (Fig.
Primary biodiversity data as defined by
Currently, the majority of primary biodiversity data consists of species-by-occurrence data records available from published sources and/or natural history collections. Other types of primary biodiversity data that merit publication are observational data and multimedia resources in biodiversity.
The GBIF Integrated Publishing Toolkit (IPT) facilitates authoring of metadata based on the GBIF Metadata Profile (GMP) that was developed to standardise how biodiversity data resources are described for discovery through the GBIF network. For further information, see the
The GMP conforms to the Ecological Metadata Language (EML) specification with some additional terms drawn from the Natural Collections Descriptions (NCD) set of terms for describing natural history collections and the ISO 19139: North American Profile of ISO 19115:2033 — Geographic Information — Metadata. The GMP elements, together with their descriptions, are listed below.
The structure of a Data Paper largely resembles that of a standard research paper. However, it must contain several specific elements. These elements are listed in Table
The
The IPT is a server-side software tool that allows users to author metadata, map databases or upload text files that conform to the Darwin Core standard, to install extensions and vocabularies to allow for richer content and, ultimately, to register datasets for publication and sharing through GBIF. IPT operators undertake the responsibility of running an Internet server which should be maintained, namely, that it should remain online and be addressable. Any set of metadata can be downloaded from any IPT (version 2.0.2+) into RTF format in the form of a data paper manuscript (Fig.
Therefore, data authors have the following options:
Install and run an IPT instance, registering it with GBIF.
Use an account on the Pensoft IPT Data Hosting Centre at
Approach any other existing IPT operator and seek to host data through them.
GBIF provides a
Once you have decided to publish your data and generate a data paper manuscript through the GBIF IPT, please consider the following simple rules:
The metadata within one IPT generated archive must describe only one core set of biodiversity data (e.g., either occurrence data, a taxon checklist, or sample data), that is uploaded through the IPT, indexed in the GBIF Data Portal, and published in Darwin Core Archive Format. The IPT will generate an RTF manuscript that will describe the core dataset. The link to the core dataset will appear in your manuscript under the heading “Data published through GBIF”.
Additional datasets that relate to the core one, e.g., ecological or environmental data, can also be briefly described within the same resource and linked through the "External links" field of the IPT. Those datasets will appear in the section “External datasets” of your manuscript.
It is possible to open a resource and enter the respective metadata for it without upload of a core dataset. This option should be used to describe a dataset that has been already uploaded to a repository (e.g., data previously indexed through GBIF for which you have a GBIF link). In this case, you will need to insert the link(s) to the dataset(s) into the "External links" field of the IPT.
The option explained in point 3 above can also be used to describe non-digital natural history collections.
We strongly recommend uploading a core set of biodiversity data through the IPT Darwin Core Archive format, which facilitates not only publication of your data but also its easy sharing and integration with other data, hence its re-use and dissemination.
As described in the previous section, data creators will be able to author data paper manuscripts in various ways. However, to lower the technical barrier and make the process easy to adopt, a conversion tool to automatically export metadata to an RTF manuscript is available in IPT 2.0.2+. The step-by-step process for generating a data paper manuscript from the metadata is described below:
The Data Creator completes the metadata for a biodiversity resource dataset using the metadata editor in IPT 2.0.2+. IPT assigns the Persistent Identifier to the authored metadata.
Once the metadata are complete to the best of the author's ability, a data paper manuscript may be generated automatically from these metadata using the automated tool available within IPT 2.0.2+ (for RTF download from the dataset webpage, see Fig.
The author checks the created manuscript, completing the textual Introduction or other appropriate sections, and then submits it for publication in the data paper section of an appropriate Pensoft journal through the online submission system (except for the Biodiversity Data Journal, One Ecosystem or RIO Journal, as these accept manuscripts in a different format).
The manuscript undergoes peer review according to the journal's policies and the Guidelines for Reviewers of data paper (below). After review, and in case of acceptance, the manuscript is returned to the author by the editor along with the reviewers' and editorial comments for any required pre-publication modifications.
The corresponding author inserts all accepted corrections or additions recommended by the reviewers and the editor in the metadata (not the manuscript of the paper), thereby improving the metadata for the data resource itself. Once the metadata have been improved, the final revised version of the data paper manuscript can then be created using the same automated metadata-to-manuscript conversion tool within IPT 2.0.2+ h was used to create the initally submitted draft (RTF download, see Fig.
After manual re-insertion of the text of the Introduction, the revised data paper manuscript can then be submitted to the journal for final review and subsequent acceptance decision.
Once the manuscript is accepted, it goes to a proofing stage, at which point submission, revision, acceptance and publication dates are added by the publisher, and a Digital Object Indentifier (DOI) is assigned to the data paper. This facilitates persistent accessibility of the online scholarly publication.
Once the final proofs are approved by the author, the data paper is published in four different formats: (a) semantically enhanced HTML to provide interactive readings and links to external resources, (b) PDF, (c) final published XML to be archived in PubMedCentral and other archives to facilitate machine readability and future data mining, and eventually also (d) print format identical to the PDF version.
After publication, the DOI of the data paper is linked with the Persistent Identifier of the metadata document registered in the GBIF Registry, which is given in the data paper. This provides multiple cross-linking between the data resource, its corresponding metadata and the corresponding data paper.
Depending on the journal's policies and scope, the published data paper will be actively disseminated through the world's leading indexers and archives, including Web of Knowledge (ISI), PubMedCentral, Scopus, Zoological Record, Google Scholar, CAB Abstracts, Directory of Open Access Journal(DOAJ), EBSCOHost, and others.
An innovative approach, similar to the that which converts EML metadata into RTF, is the direct conversion of an EML file (supported versions 2.2.0 and 2.2.1) downloaded from GBIF IPT (Fig.
The users of ARPHA need to save a dataset's metadata as an EML file (versions 2.1.1 and 2.1.0, support for other versions is being continually updated) from the website of the respective data provider (see Fig.
Click on the "Start a manuscript" button in AWT and then select "Biodiversity Data Journal" and the "Data paper (Biosciences)" template (Fig.
Upload this file via the "Import a manuscript" function on the AWT interface (Fig.
Continue with updating and editing and finally submit your manuscript inside AWT.
Metadata descriptions of primary biodiversity data used in the GBIF Metadata Proifile (GMP) and the Integrated Publishing Toolkit (IPT) are based primarily on the
Authors intending to publish data papers describing ecological and environmental data are advised to use the following steps:
Deposit your data in an ISO-certified public (international or institutional) repository.
Write a data paper manuscript following the structure of the sample data paper, adding additional elements/sections to the manuscript if these are necessary to describe the specifics of your dataset(s).
Add the permanent link(s) in the manuscript to the particular dataset(s) hosted in the repository you have chosen.
Submit the data paper to an appropriate Pensoft journal.
Once the paper is accepted and published, enter the bibliographic reference and the DOI of the data paper in the relevant metadata field of your data package in the repository that hosts your data.
Alternatively, EML metadata files (versions 2.2.0 and 2.2.1) hosted in DataONE and LTER can automatically be converted into а data paper manuscript using the ARPHA Writing Tool import workflow described in the previous section (see also
Pensoft journals require, as a condition for publication, that genome data supporting the results in the paper should be archived in an appropriate public archive, and accession numbers must be included in the final version of the paper. Sufficient additional metadata (such as sample locations, individual identities, etc.) should also be provided to allow easy repetition of analyses presented in the paper. For best practice in following community metadata standards, see the many data-type specific standards and checklists provided by the
DNA sequence data should be archived in
Barcode-of-Life COI (mitochondrial encoded
The BARCODE Data Release Paper manuscript should describe:
The scope of taxonomic, ecological, and geographic coverage;
The sources of voucher specimens;
The sampling and laboratory protocols used;
The processes used to identify the species to which voucher specimens belong.
The manuscript should provide summaries of data density and quality such as those shown in Table
Manuscripts should also include an Appendix with a table that presents:
The taxonomic identification (a formal species name or a provisional species label in a public database);
The collecting locality to a reasonable level of precision;
The voucher specimen identifier in the format required in the BARCODE data standard;
The accession number in GenBank, EMBL or DDBJ; and
The Barcode of Life Data Systems (BOLD) record number (optional).
An increasing number of software tools also merit description in scholarly publications. The structure of the data paper proposed below for such software tools is largely based on the
Software citation principles have been developed by the
Importance: Software should be considered a legitimate and citable product of research.
Credit and Attribution: Software citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the software.
Unique Identification: A software citation should include a method for identification that is machine actionable, globally unique, interoperable, and recognizable.
Persistence: Unique identifiers and metadata describing the software and its disposition should persist.
Accessibility: Software citations should facilitate access to the software itself and to its associated metadata, documentation, data, and other materials.
Specificity: Software citations should facilitate identification of, and access to, the specific version of software that was used.
Based on an analysis of several use cases such as publishing a software paper or publishing papers that cite software, basic metadata requirements were identified: unique identifier, software name, author(s), contributor role, version number, release date, location/repository, indexed citations, software license, description, keywords.
While the provision of detailed specifications and recommendations around metadata standards were beyond the scope of the working group, DOAP is mentioned together with some other more recent community initiatives. It is expected that a new working group will take these software citation principles forward by supporting potential implementers and developing metadata standards, following the example of the FORCE11 Data Citation Working Group (
According to DOAP, major properties of a software tool description include elements such as homepage, developer, programming language and operational system. Other properties include: Implements specification, anonymous root, platform, browse, mailing list, category, description, helper, tester, short description, audience, screenshots, translator, module, documenter, wiki, repository, name, repositorylocation, language, service endpoint, created, download mirror, vendor, old homepage, revision, download page, license, bug database, maintainer, blog, file-release, and release.
A basic version of a DOAP description can be generated using an online tool called
A sample structure of a Software Description paper was introduced and used by Pensoft since 2011 (
Data papers describing data resources — or manuscripts linked to open data resources that underpin the scientific analyses — that are submitted to Pensoft journals will be subjected to peer review according to the respective journal's policies (e.g., conventional pre-publication anonymous, non-anonymous, or entirely open and public, including post-publication review) as a routine method to enhance the completeness, truthfulness and accuracy of the descriptions of the relevant data resources, thereby improving their use and uptake. A specific feature of the ARPHA-XML journal publishing workflow used by the
Peer review of data papers is expected to evaluate the completeness and quality of the dataset(s) description (metadata), as well as the publication value of data. This may include the appropriateness and validity of the methods used, compliance with applicable standards during collection, management and curation of data, and compliance with appropriate metadata standards in the description of the data resources. In order to allow for accuracy and usefulness, metadata needs to be as complete and descriptive as possible.
Reviewers will consider the following aspects of (a) the quality of the manuscript, (b) the quality of the data, and (c) the consistency between the description within the data paper and the repository-held metadata relating the data resource itself.
Peer review of the data is rather problematic in the current scholarly publishing practice. There are several reasons for that:
Authors are not sufficiently trained in and accustomed to the good practices of formatting and describing their data.
Reviewers do not pay sufficient attention to data reviews. A proper review of large datasets may appear merely impossible due to the volume of work.
Editors are not sufficiently experienced in data review, which often requires specific training in data management.
Data are of different types and specificities, which imposes additional problems to find suitable reviewers or editors.
Data standards to consider as a "rule-to-follow" are at different levels of development and adoption by different communities.
Several Pensoft journals offer an additional service for auditing and correcting data, which might be a solution for those authors or their institutions who really care about data quality and re-use.
Best practice recommendations for evaluating data papers or manuscripts that are submitted together with the underlying data are summarised below.
Does the manuscript conform to the focus and scope of this journal?
Does the manuscript contain unpublishable — for example fraudulent or pseudoscientific — content?
Does the manuscript contain sufficiently detailed information to merit publication?
Do the title, abstract and keywords accurately reflect the contents of the manuscript?
Is the manuscript internally consistent and suitably organized?
Is the manuscript written in grammatically and stylistically correct English?
Are the methods relevant to the study and adequately described?
Did the authors cite most of the literature pertinent to the subject?
Are relevant non-textual media (e.g. tables, figures, audio, video) used to an appropriate extent and in a suitable manner?
Have abbreviations and symbols been properly defined?
Are the illustrations of sufficient quality?
Does the manuscript put the data resource being described properly into the context of prior research, citing pertinent articles and datasets?
Are conflicts of interest, relevant permissions and other ethical issues addressed in an appropriate manner?
Are the data freely and openly available under an appropriate Creative Commons license or waiver?
Is the repository to which the data are submitted appropriate for the nature of the data?
Are the data completely and consistently recorded within the dataset(s)?
Does the data resource cover scientifically important and sufficiently large region(s), time period(s) and/or group(s) of taxa to be worthy of a separate publication?
Are the data consistent internally and described using applicable standards (e.g. in terms of file formats, file names, file size, units and metadata)?
Are the methods used to process and analyse the raw data, thereby creating processed data or analytical results, sufficiently well documented that they could be repeated by third parties?
Are the data plausible, given the protocols? Authors are encouraged to report any tests undertaken to address this point.
Does the manuscript provide an accurate description of the data?
Does the manuscript properly describe how to access the data?
Are the methods used to generate the data (including calibration, code and suitable controls) described in sufficient detail?
Is the dataset sufficiently unique to merit publication as a data paper?
Are the use cases described in the data paper consistent with the data presented? Would other possible use cases merit comment in the paper?
Have possible sources of error been appropriately addressed in the protocols and/ or the paper?
Is anything missing in the manuscript or the data resource itself that would prevent replication of the measurements, or reproduction of the figures or other representations of the data?
Are all claims made in the manuscript substantiated by the underlying data?
We thank several colleagues who commented or contributed to an earlier version of the draft: Tim Robertson and Kyle Braak (GBIF), Todd Vision and Peggy Schaeffer (Dryad). We also thank all our authors, reviewers, editors and partners for their support in testing and using these data publishing guidelines and workflows. Special thanks are due to Donat Agosty, Terry Catapano, Guido Sautter and Willi Egloff from Plazi (Switzerland) for the years-long successful collaboration, and friendship, and to Robert Mesibov (Tasmania) and Florian Wetzel (Museum für Naturkunde, Berlin) who provided open pre-submission peer reviews to the manuscript.
The present guidelines were elaborated through the FP7 funded project
Recommendation of Dryad to cite both the original article in association with which the data were published and the data themselves.
Import of specimen records from GBIF, BOLD, iDigBio and PlutoF into ARPHA manuscripts.
The user interface of the ARPHA Writing Tool through which single or multiple specimen records from GBIF, BOLD, iDigBio and PlutoF are imported through records identifiers.
Occurrence records and taxonomic treatments (if present in the article), published in the Biodiversity Data Journal, are exported in two separate Darwin Core Archives (DwC-A) and are available for direct download or harvesting via web services.
The metadata from the GBIF Integrated Publishing Toolkit (IPT) can be downloaded as RTF or EML files and submitted to Pensoft's journals as data paper manuscripts.
Automated creation of data paper manuscripts from Ecological Metadata Language (EML) metadata in the ARPHA Writing Tool.
Selection of the journal and "Data Paper (Biosciences)" template in the ARPHA Writing Tool.
Import of a data paper manuscript from EML file in the ARPHA Writing Tool.
Data publishing licenses recommended by Pensoft.
|
|
Open Data Commons Attribution License |
|
Creative Commons CC-Zero Waiver |
|
Open Data Commons Public Domain Dedication and License |
|
Structure of a data paper and its mapping from GBIF IPT Metadata Profile elements.
|
|
<TITLE> | Derived from the ‘title’ element. Format: a centred sentence without a full stop |
<Authors> | Derived from the ‘creator’, ‘metadataProvider’ and ‘AssociatedParty’ elements. From these elements, combinations of ‘first name’ and ‘last name’ are derived, separated by commas(,). |
<Affiliations> | Derived from the ‘creator’, ‘metadataProvider’ and ‘AssociatedParty’ elements. |
<Corresponding authors> | Derived from the ‘creator’ and ‘metadataProvider’ elements. |
<Received, Revised, Accepted, and Published dates> | These will be inserted manually by the Publisher of the data paper, to indicate the dates of original manuscript submission, revised manuscript submission, acceptance of manuscript and publication of the manuscript as a data paper in the journal. |
<Citation> | This will be inserted manually by the Publisher of the data paper. |
<Abstract> | Derived from the ‘abstract’ element. Format: indented from both sides. |
<Keywords> | Derived from ‘keyword’ element. Keywords are separated by commas (,). |
<Introduction> | Free text. |
<Taxonomic Coverage> | Derived from the Taxonomic Coverage elements. |
<Spatial Coverage> | Derived from the Spatial Coverage elements. These elements are ‘general geographic description’, ‘westBoundingCoordinate’, ‘eastBoundingCoordinate’, ‘northBoundingCoordinate’, |
<Temporal Coverage> | Derived from the Temporal Coverage elements namely, ‘beginDate’ and ‘endDate’. |
<Project Description> | Derived from project elements as described in the GBIF Metadata Profile. |
<Natural Collections |
Derived from project NCD elements as described in the GBIF Metadata profile. These elements are ‘parentCollectionIdentifier’, ‘collectionName’, ‘collectionIdentifier’, formationPeriod’, ‘livingTimePeriod’, ‘specimenPreservationMethod’, and ‘curatorialUnit’. |
<Methods> | Derivedfrom methods elements as described in the GBIF Metadata Profile. |
<Dataset descriptions> | Derived from physical and other elements as described in the GBIF Metadata Profile. |
<Additional |
Derived from ‘additionalInfo’ element. |
<References> | Derived from ‘citation’ element. |
Suggested data fields for a BARCODE Data Release Paper
|
|
Range of records per species | Min-Max |
Average sequence length (and Min/Max) | |
Range of intraspecific variation* | Min-Max |
Median variation within species* | X% |
Range of divergence between closest species-pairs** | Min-Max |
Median divergence between closest species-pairs** |
* Calculated as the arithmetic average of all K2P distances between specimens in each species.
** Closest species pairs refers to each species and the species with which it has the least divergent barcode sequence. The true phylogenetic sister-species may not be included in the dataset, and could have a lower interspecies divergence.
Metadata elements (based on EML and DOAP) to be included in a data paper describing a software tool
|
|
<TITLE> | Derived from the ‘name’ element. This must be extended to a concise description of the software tool and its implementation, e.g.: “BioDiv, a web-based tool for calculation of biodiversity indexes”. Format: This is a centred sentence without full stop (.) at the end. |
<Authors> | Derived from the ‘developer’, ‘maintainer’ and eventually ‘helper’, ‘tester’, and ‘documenter’. From these elements, combinations of ‘first name’ and ‘last name’ are derived, separated by commas (,). Corresponding affiliations of the authors are denoted with numbers (1, 2, 3,...) in superscript at the end of each last name. If two or more authors share same affiliation, it will be denoted by use of the same superscript number. Format: centred. |
<Affiliations> | Derived from the ‘developer’, ‘maintainer’ and ‘helper’. From these elements, combinations of ‘Organisation Name’, ‘Address’, ‘Postal Code’, |
<Corresponding authors> | Derived from any of the ‘developer’, ‘maintainer’, ‘helper’, ‘tester’, and |
<Received, Revised, Accepted, and Published dates> | These will be inserted manually by the Publisher of the data paper to indicate the dates of original manuscript submission, revised manuscript submission, acceptance of manuscript and publishing of the manuscript as data paper in the journal. |
<Citation> | This will be inserted manually by the Publisher of the data paper. It will be a combination of Authors, Year of data paper publication (in parentheses), Title, Journal Name, Volume, Issue number (in parentheses), and DOI of the data paper in both native and resolvable HTTP format. |
<Abstract> | Derived from the ‘short description’ element. Format: indented from both sides. |
<Keywords> | Keywords should reflect most important features of the tool and areas of implementation, and should be separated by commas (,). |
<Introduction> | Free text. |
<Project Description> | Derived from ‘description’ element; if applicable, it should also include sub-elements such as ‘title’ of the project, ‘personnel’ involved in the project, ‘funding’ sources’, and other appropriate information. |
<Web Location (URIs)> | Derived from the elements ‘homepage’, ‘wiki’, ‘download page’, ‘download mirror’, ‘bug database’, ‘mailing list’, ‘blog’, ‘vendor’ |
<Technical specification> | Derived from the elements ‘platform’, ‘programming language’, |
<Repository> | Derived from the elements ‘repository type’ (CVS, SVN, Arch, BK), |
<License> | Derived from the ‘license’ element |
<Implementation> | Derived from ‘Implements specification’ and ‘audience’ elements; please remember that this section is of primary interest to end users, and should be written in detail, if possible including use cases, citations and links. |
<Additional Information> | Any kind of helpful additional information may be included. |
<Acknowledgement> | Lists all acknowledgments at the authors' discretion. |
<References> | Includes literature references and web links cited in the text. |