Research Ideas and Outcomes : Project Report
|
Corresponding author: Lyubomir Penev (penev@pensoft.net)
Received: 05 Apr 2017 | Published: 05 Apr 2017
© 2017 Lyubomir Penev, Teodor Georgiev, Peter Geshev, Seyhan Demirov, Viktor Senderov, Iliyana Kuzmova, Iva Kostadinova, Slavena Peneva, Pavel Stoev
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Penev L, Georgiev T, Geshev P, Demirov S, Senderov V, Kuzmova I, Kostadinova I, Peneva S, Stoev P (2017) ARPHA-BioDiv: A toolbox for scholarly publication and dissemination of biodiversity data based on the ARPHA Publishing Platform. Research Ideas and Outcomes 3: e13088. https://doi.org/10.3897/rio.3.e13088
|
The ARPHA-BioDiv Тoolbox for Scholarly Publishing and Dissemination of Biodiversity Data is a set of standards, guidelines, recommendations, tools, workflows, journals and services, based on the ARPHA Publishing Platform of Pensoft, designed to ease scholarly publishing of biodiversity and biodiversity-related data that are of primary interest to EU BON and GEO BON networks. ARPHA-BioDiv is based on the infrastructure, knowledge and exeprience gathered in the years-long research, development and publishing activities of Pensoft, upgraded with novel tools and workflows that resulted from the FP7 project EU BON.
The transformation from human- to machine-readability of published content is a key feature of the dramatic changes experienced by academic publishing in the last decade. Non-machine readable PDFs, either digitally born or scanned from paper prints, require significant additional effort of post-publication markup and data extraction into a structured form, in order to address issues of interoperability and reuse of publications and data (
The next stage of development of integrated narrative and data publishing was landmarked by the Biodiversity Data Journal (BDJ) and its associated authoring tool, ARPHA Writing Tool (AWT), launched within the ViBRANT EU Framework Seven (FP7) project (
The third stage of Pensoft's effort towards open science publishing was the launch of the Research Ideas and Outcomes (RIO) journal that publishes all outputs of the research cycle, beginning with research ideas; project proposals; data and software management plans; data; methods; workflows; software; and going all the way to project reports; research and review articles, using the most transparent, open and public peer review process (
Eventually, all these years spent in development of novel approaches to publication of biodiversity data resulted in a set of standards, guidelines, workflows, tools, journals and services which we define here as ARPHA-BioDiv: A Toolbox for Scholarly Publishing and Dissemination of Biodiversity Data (Fig.
The market for online collaborative writing tools has long been dominated by Google Docs. However, as it is too generic, it has not met the specific demands of academic publishing and, in recent years, some start-ups have developed platforms and services to fulfil this increasing gap in the publishing market. Some examples include Overleaf (originally WriteLaTeX), Authorea, ShareLatex and others, most of them being based on LaTeX, but differing in the level of complexity and features for manuscript writing. For people unfamiliar with LaTeX, the learning curve is steep which explains the comparatively restricted usage, mostly centred around the LaTeX community. Currently, none of the above-mentioned tools provides all the components of an end-to-end authoring, peer review and publishing pipeline. For instance, most tools lack a peer review system and rely on integrations with well-established platforms, such as Editorial Manager, ScholarOne, or others.
ARPHA has emerged as the first ever publishing platform to support the full life cycle of a manuscript, from authoring through submission, peer review, publication and dissemination, within a single, fully Web- and XML-based, online collaborative environment. The acronym ARPHA stands for "Authoring, Reviewing, Publishing, Hosting and Archiving" - all in one place, for the first time. The most distinct feature of ARPHA, amongst others, is that it consists of two interconnected but independently functioning journal publishing platforms. Thus, it can provide to journals and publishers either of the two or a combination of both services by enabling a smooth transition from the conventional, document-based workflows to fully XML-based publishing (Fig.
ARPHA consists of two independent journal publishing workflows: (1) ARPHA-XML, where the manuscript is written and processed via ARPHA Writing Tool and (2) ARPHA-DOC, where the manuscript is submitted and processed as document file(s).
ARPHA-XML: Entirely XML- and Web-based, collaborative authoring, peer review and publication workflow;
ARPHA-DOC: Document-based submission, peer review and publication workflow.
The two workflows use a one-stop login interface and a common peer-review and editorial manuscript tracking system. The XML-based workflow in use at Biodiversity Data Journal (BDJ) was the first of its kind back in 2013 and has since seen continuous refinement over the course of more than three years of active use by the biodiversity research community. It is also now used by the Research Ideas and Outcomes (RIO), One Ecosystem and BioDiscovery journals. The second, file-based submission workflow, is currently used by ZooKeys, PhytoKeys, MycoKeys, Journal of Hymenoptera Research, Nature Conservation, Deutsche Entomologische Zeitschrift, Zoosystematics and Evolution, NeoBiota and other journals, published by Pensoft.
At the core of the ARPHA-XML workflow is the collaborative online manuscript authoring module called ARPHA Writing Tool (AWT). AWT’s innovative features allow for upfront markup, automisation and structuring of the free-text content during the authoring process, import/download of structured data into/from human-readable text, automated export and dissemination of small data, on-the-fly layout of composite figures and import of literature and data references from online resources. ARPHA-XML is also perhaps the first journal publishing system that allows for submission of complex manuscripts via a dedicated API.
The generic and domain-specific features of ARPHA (used for publication and dissemination of biodiversity data via the ARPHa-BioDiv toolbox) are listed in Table
Generic features of the ARPHA Journal Publishing Platform
FEATURE | ARPHA-DOC | ARPHA-XML |
ARPHA is a combination of software platform and a wide range of associated services. | X | X |
ARPHA serves individual journals or multiple journal platforms. | X | X |
Integrated with the industry leading indexing and archiving platform (see list) through web services, APIs and data exchange protocols. | X | X |
Individual journal website design. | X | X |
Customisable submission module. | X | X |
Peer review and editorial management system. | X | X |
Peer review process customisable by journal. It can be conventional (either single-blind or double-blind), community-sourced, or public. | X | X |
Online collaborative authoring tool (ARPHA Writing Tool, abbreviated AWT, formerly Pensoft Writing Tool, abbreviated PWT), closely integrated with submission, peer review, production and dissemination tools. | X | |
Collaborative work on a manuscript with co-authors; external contributors, such as mentors; pre-submission reviewers; linguistic and copy editors; or colleagues. The external contributors are not listed as co-authors of the manuscript. | X | |
Large set of pre-defined, but flexible article templates covering many types of research outcomes. |
X | |
Online search and import of literature or data references; cross-referencing of in-text citations; import of tables; upload of images and multimedia; assembling images for display as composite figures. | X | |
Automated technical validation step (it can be triggered by authors any time) checks the manuscript for consistency and for compliance with the JATS standard as well as the journal's requirements. | X | |
Human-based, interactive pre-submission technical check and validation tool helps authors to proceed with their manuscripts to a form almost ready for publication. | X | |
Pre-submission external peer review(s) performed during the authoring process. The pre-submission peer reviews are submitted together with the manuscript to prompt editorial evaluation and publication. | X | |
For editor's convenience, peer reviews in ARPHA are automatically consolidated into a single online file that makes the editorial process straightforward, easy and comfortable. | X | |
In the ARPHA-XML workflow, authors can publish updated versions of their articles anytime. |
X | |
Automated archiving of all published articles in Zenodo and CLOCKSS on the day of publication. | X | X |
Domain-specific features of the ARPHA-BioDiv toolbox used for publication and dissemination of biodiversity data
FEATURE | ARPHA-DOC | ARPHA-XML |
Markup and visualisation of all taxon names used in the text. |
X | X |
Markup and visualisation of taxon treatments following the TaxPub XML schema (an extension of the Journal Archiving Tag Suite (JATS) used by PubMed and PubMedCentral). | X | X |
Markup and automated mapping of geo-coordinates of geographical locations. | X | X |
Markup and visualisation of biological collection codes against the Global Registry of Biological Repositories (GRBIO) vocabulary ( |
X | X |
Pre-publication registration of new taxa in ZooBank, IPNI or Index Fungorum (as relevant). | X | X |
Dynamic, real-time creation of online profile for each taxon name mentioned in an article through the Pensoft Taxon Profile tool. | X | X |
Automated linking through the Pensoft Taxon Profile tool of each taxon name mentioned in an article to various biodiversity resources (GBIF, Encyclopedia of Life, Biodiversity Heritage Library, the National Center for Biodiversity Information (NCBI), Genbank and Barcode of Life, PubMed, PubMedCentral, Google Scholar, the International Plant Name Index (IPNI), MycoBank, Index Fungorum, ZooBank, PLANTS, Tropicos, Wikispecies, Wikipedia, Species-ID and others). | X | X |
Workflow integration with the GBIF Integrated Publishing Toolkit (IPT) for deposition, publication and permanent linking between data and articles of primary biodiversity data (species-by-occurrence records), checklists and their associated metadata. | X | X |
Workflow integration with the Dryad Data Repository for deposition, publication and permanent linking between data and articles of datasets other than primary biodiversity data (e.g. ecological observations, environmental data, genome data and other data types). | X | X |
Export of XML-based metadata and TaxPub XMLs of the papers to PubMedCentral. |
X | X |
Automated export of all taxon treatments (new taxa and re-descriptions, including images) to Encyclopedia of Life. Example: http://eol.org/pages/21232877/overview. |
X | X |
Automated export of all taxon treatments (new taxa and re-descriptions) to Plazi TreatmentBank. Example: http://tb.plazi.org/GgServer/html/B07E9CD77F60DCC65C10A381F6E3BBF0 | X | X |
Automated export of all taxon treatments (new taxa and re-descriptions), including images, keys, etc. to the Wiki repository Species-ID. Example: http://species-id.net/wiki/Spigelia_genuflexa. | X | X |
Automated export of the occurrence data published in BDJ into Darwin Core Archive (DwC-A) format (see also |
X | |
Automated export of the taxonomic treatments published in BDJ into Darwin Core Archive. The DwC-A is freely available for download from each article's webpage that contains taxonomic treatments data. | X | |
Automated export and archiving of images from the published articles in Zenodo. Images from biodiversity journals are imported into the Biodiversity Literature Respository (BLR) of Zenodo. | X | X |
Import of Darwin Core-compliant primary biodiversity data from spreadsheet templates or via a manual Darwin Core editor and consequent publication in a structured downloadable format ( |
X | |
Direct online import of Darwin Core-compliant primary biodiversity data from GBIF, Barcode of Life, iDigBio, and PlutoF into manuscripts ( |
X | |
Multiple import of voucher specimen records associated with a particular Barcode Index Number (BIN) ( |
X | |
Automated generation of data paper manuscripts from Ecological Metadata Language (EML) metadata files stored at GBIF Integrated Publishing Toolkit (GBIF IPT), DataONE and the Long Term Ecological Research Network (LTER) ( |
X | |
Novel article types in ARPHA Writing Tool: Taxonomic Paper, Data Paper, Software Description, Monitoring Schema, Ecosystem Inventory, Ecosystem Service Inventory, Ecosystem Service Models, Species Conservation Profile, compliant with the IUCN Red List ( |
X* |
X |
Nomenclatural acts modelled and developed in BDJ as different types of taxonomic treatments for plant taxonomy. | X | |
Automated archiving of all biodiversity articles in the Biodiversity Literature Respository (BLR) of Zenodo. | X | X |
Research articles have traditionally been containers for scientiifc results for several centuries and this holds even more for research books. The Internet era brought disruptive changes to academic publishing and one of these is that the notion of the research article as the only valid output for scientific endeavours was challenged. Resulting from this, novel article formats started to proliferate in an attempt to publish extra research objects from across the research cycle, such as methods, data and software. Pensoft pioneered several novel article formats with the launch of the Biodiversity Data Journal. Currently, the ARPHA Writing Tool supports nearly fifty article formats (Fig.
A data paper is a scholarly journal publication whose primary purpose is to describe a dataset or a group of datasets, rather than report a research investigation. As such, it contains facts about data, rather than hypotheses and arguments in support of those hypotheses based upon data, as found in a conventional research article (for details, see
Examples from: ZooKeys, Biodiversity Data Journal, PhytoKeys, Nature Conservation.
The Article template is available for Biodiversity Data Journal, One Ecosystem, Research Ideas and Outcomes (RIO), BioDiscovery.
A publication that describes software or an online platform. It contains a link to an openly accessible code (for details, see
Examples from: Biodiversity Data Journal.
Customisable templates are available for Biodiversity Data Journal, Research Ideas and Outcomes (RIO), One Ecosystem and BioDiscovery.
A description of an R Package including information on its purpose, installation and usage. The code should be openly available and a link to it should be present in the article.
The Article template is available for Biodiversity Data Journal, One Ecosystem, Research Ideas and Outcomes (RIO).
A brief description of a monitoring schema including information on the monitored system component; its location; indicators used; spatial and temporal scales; purpose of the monitoring programme; and potential application of the resulting data.
The Article template is available for Research Ideas and Outcomes (RIO) and One Ecosystem.
A publication of a single or multiple IUCN species assessment report(s) imported and edited in an IUCN-compliant species template.
Examples from: Biodiversity Data Journal.
The Article template is available for Biodiversity Data Journal.
An assessment report of alien or invasive species following an IUCN-compliant species template. After publication, the article can be exported to the Global Invasive Species Database (GISD).
The Article template is available for Biodiversity Data Journal.
A brief description of a specific ecosystem type; its structures; processes and functions; abundant species; biodiversity; anthropogenic pressures; and management options. Data could result from, for example, direct observations, monitoring programmes, modelling or literature and database reviews.
The Article template is available for One Ecosystem.
A brief description of an ecosystem service mapping study or application including information on the purpose of the map; data and methods used (biophysical, economic, social); mapped ecosystem service; mapped beneficiary (ecosystem service potential, flow, demand); spatial and temporal scale and indicators. The resulting maps should be included in the manuscript or uploaded to the ESP Visualisation Tool.
The Article template is available for One Ecosystem.
A brief description of an ecosystem service mapping study or application including information on the purpose of the map; data and methods used (biophysical, economic, social); mapped ecosystem service; mapped beneficiary (ecosystem service potential, flow, demand); spatial and temporal scale and indicators. The resulting maps should be included in the manuscript or uploaded to the ESP Visualisation tool.
The Article template is available for One Ecosystem.
In 2010, ZooKeys published its 50th issue Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research in a new format based on pre-publication tagging of biodiversity-specific terms in the article XML and semantic enhancements to the published paper (
Examples of use of the domain-specific XML markup in the published artices.
The "integrated narrative and data publishing", or "integrated data publishing", is a relatively new approach, assuming that data or code are imported in a structured form in the manuscript text and are downloadable from the published article. In biodiversity science, this term has been coined and first demonstrated by the Biodiversity Data Journal (BDJ), developed in the course of the EU-funded project ViBRANT (
The ARPHA Writing Tool provides online direct import from external databases using community-accepted standards (e.g. within the biodiversity community, these are Darwin Core, TaxPub JATS extension and others - see http://www.tdwg.org/standards/). Initially, data import was from CSV spreadsheets or manually via a Darwin Core HTML editor (
Another example of online import of structured text is the ReFindit tool which exists both as a stand-alone application and a plugin in ARPHA Writing Tool. ReFindit locates and imports literature and data references from CrossRef, DataCite, RefBank, Global Names Usage Bank (GNUB) and Mendeley.
Article content that is tagged and available in TaxPub XML can be harvested by aggregators which can select and pick sub-article elements, such as metadata, taxon treatments, occurrence records, images and others. Several of these aggregators are major players in biodiversity data preservation and management, for example, GBIF, Encycopedia of Life, Biodiversity Heritage Library, Plazi, Biodiversity Literature Repository at Zenodo, ZooBank, International Plant Names Index, MycoBank, Index Fungorum and many others. The data export in some cases is provided by a featured outbound API. The workflows and aggregators that use the semantically enriched article XMLs are listed in Table
Extraction and delivery of data and content from published articles to aggregators, nomenclators, archives, and indexers.
All data published in the Biodiversity Data Journal can be downloaded in tabular format (CSV) straight from the article text and re-used by anyone, provided that the original source is cited (Fig.
Export of data from articles published in Biodiversity Data Journal. Species occurrences and other structured data tables can be downloaded in CSV format (green arrow); all species occurrences are also available as Darwin Core Archives and are automatically harvested and indexed by GBIF (red box and arrow).
The present workflow has been created and tested with three different book titles with the support of the EU-funded projects pro-iBiospehere, SCALES and EU BON. It resulted in the launch of the Advanced Books platform of Pensoft, designed to (re-)publish historical or new books in semantically enhanced open access. The workflow is illustrated on the main homepage of Advanced Books at http://ab.pensoft.net and in Fig.
A distinct feature of the ARPHA-XML publishing workflow is the possibility to import complex manuscripts, including metadata, text figures, tables, references, citations and others, via an API available in ARPHA Writing Tool (Fig.
Submission of manuscripts to ARPHA Writing Tool through Application Programming Interface (API).
In order to submit an article via the Pensoft RESTful API, one has first to prepare an XML file according to the Pensoft XML schemas or according to the Ecological Metadata Language (EML) standard (information listed in the link above). An authentication token is obtained from the settings dialogue in the ARPHA-BioDiv which is supplied together with the XML file to the endpoint. If the document is imported successfully, it is created in the respective journal's ARPHA Writing Tool instance, where it can be further edited manually and submitted to the journal.
Data papers, often called also “data articles”, “data notes”, or similar, were first established by the journals Ecological Archives (published by the Ecological Society of America*
The data paper should include several important elements (usually called metadata, or “description of data”), for example:
Title, authors and abstract;
Project description;
Methods of data collection;
Spatial and temporal ranges and geographical coverage;
Collectors and owners of the data;
Data usage rights and licences;
Software used to create or view the data.
These metadata, if available and deliverable in machine-readable form (XML, JSON, etc.), can be used to produce a “data paper manuscript” that can be submitted to a journal for peer review and publication. The ARPHA approach to data paper publishing was first demonstrated in 2010 in a joint project of the Global Biodiversity Information Facility (GBIF) and Pensoft. As a result, this partnership created a workflow (Fig.
Creation of data paper manuscripts from Ecological Metadata Language (EML) metadata hosted at the GBIF IPT
Recently, the workflow was amended by a direct import functionality of EML metadata downloadable from GBIF, LTER and DatONE networks on to a data paper manuscript in ARPHA Writing Tool (
The ARPHA-BioDiv toolbox has been developed in the course of several years and its tools, workflows and journals are used routinely by thousands of authors, reviewers, editors and readers worldwide. It is virtually impossible to list here the numerous use cases and approaches that have been tested and succesfully implemented over the years (see
One of the major data mobilisation initiatives realised by ARPHA and the Biodiversity Data Journal is the publication of data papers on the largest European animal database 'Fauna Europaea' within a new series "Contributions on Fauna Europaea", launched in 2014. This novel publication model was aimed at assembling in a single collection data papers on different taxonomic groups of higher rank covered by the Fauna Europaea project and accompanying papers highlighting various aspects of this project (gap-analysis, design, taxonomic assessments etc.) (
The LifeWatchGreece special collection LifeWatchGreece: Research infrastructure (ESFRI) for biodiversity data and data observatories was published in the Biodiversity Data Journal and currently contains twenty-three papers organised in four sections:(1) Electronic infrastructure and software applications; (2) Taxonomic checklists; (3) Data papers and (4) Research articles (
The journal Research Ideas and Outcomes (RIO) was designed to publish all outputs of the research cycle, from research ideas and grant proposals to data, software, research articles and research collaterals, such as workshop and project reports, guidelines, policy briefs, Wikipedia articles and others (
The legal framework and policies for publishing and re-use of biodiversity data is a subject of primary interest to the biodiversity community and policy-makers. Several EU BON teams and tasks worked on various aspects of the subject which resulted in the following set of documents:
The last two documents summarise the effort and can serve as guidelines and recommendations in the work Group on Earth Observation’s Biodiversity Observation Network (GEO BON) and beyond.
The paper of
This section from the paper of
The recommended data publishing licence used by Pensoft is the Open Data Commons Attribution License (ODC-By), which is a licence agreement intended to allow users to freely share, modify and use the published data(base), provided that the data creators are attributed (cited or acknowledged). This ensures that those who publish their data receive the academic credit that is due.
Alternatively, other licences, namely the Creative Commons CC0 (also cited as “CC-Zero” or “CC-zero”) and the Open Data Commons Public Domain Dedication and Licence (PDDL), are also STRONGLY encouraged for use in the Pensoft journals. According to the CC0 licence, "the person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighbouring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission."
The Strategies and Guidelines for Scholarly Publishing of Biodiversity Data (
The paper also contains detailed instructions on how to prepare and peer review data intended for publication, listed under the Guidelines for Authors and Reviewers, respectively. Special attention is given to existing standards, protocols and tools to facilitate data publishing, such as the GBIF Integrated Publishing Toolkit (IPT) and the DarwinCore Archive (DwC-A).
Here, we include the table of contents of the document which will give the reader a comprehensive overview of its content (
The Strategies and Guidelines are referred to in the Author Guidelines of Pensoft's journals and are used in their everyday publishing practices.
The current article describes the rationale, overall structure and the key elements of ARPHA-BioDiv. The various elements of ARPHA-BioDiv have been featured in several papers (cited in the respective sections of the present document), guidelines, blog posts and tutorials. Below, some important supporting documentation are listed to assist the users to access this complex system.
In the future, we want to reimagine and reinvent the academic publishing process. At the dawn of academic publishing, papers had been written especially for human consumption. The human mind alone was expected to crunch the data. Now humans rely on computers to store and manipulate the data and verify the correctness of numerical algorithms, whereas our minds focus on the big picture and the story behind the data.
With ARPHA-BioDiv, we have already taken the first few steps in creating articles that can be read both by humans and computers, as has been described so far in this article. However, more can be done. One area of innovation in academic publishing lies in creating linked content - embedding machine-readable database records in each publication that are linked to the world-wide network of linked knowledge hubs. To achieve this goal, we are currently working towards exporting content that has been semantically enriched in a knowledge graph called the Open Biodiversity Knowledge Management System or OpenBioDiv for short (
This will enable the reader of an aritcle, for example, to connect published occurrence data to portals such as GBIF and geographic repositories such as GeoNames. An illustration of the use-value of this integration will be, for example, an accelerated creation of various models, such as species distribution models, based on the article data. Thanks to the linking of the occurrence data in the article to databases, it will be possible to assemble all the elements needed for a species distribution model of the discussed taxon programmatically in an environment such as R. Moreover, the links in themselves are valuable information and can point to "hot" topics, such as "hot" taxa or "hot" figures, having many incoming links to them (
We also believe that a large portion of tradional academic publishing, even if enriched with Linked Data, will be supplemented by nano-publications (
Finally, we believe that publishers are stewards of the worlds' scientific information and there is knowledge in the totality of the published articles that is not part of any article alone. We are working on artifical intelligence algorithms both from the machine logic domain and from the machine learning domain to discover this hidden knowledge. The authors of tomorrow will have at their disposal not only a tool to format their manuscript, add citations and mark-up their data, but also tools that will discover additional information relevant to the authors' ideas and suggest similar research during the authoring phase. And, if we can dream very big, why not have artificial intelligence algorithms sophisticated enough to act as a research assistant during the authoring phase? What a marvelous thought!
The basic infrastructure for importing specimen records was partially supported by the FP7-funded project EU BON - Building the European Biodiversity Observation Network, grant agreement ENV30845. V. Senderov's PhD is financed through the EU Marie-Sklodovska-Curie Program Grant Agreement Nr. 642241.
LP - vision and management; TG - technical supervision; PG, SD - software development; VS - online import of specimen records and EML metadata; IK, IK - elaboration of tutorials, promotion and PR support; SP - webdesign; PS - editorial supervision and project management.
Only a part of the novel article templates (e.g. Data Papers, Software Descriptions and some others) are available in the ARPHA-DOC workflow.