Corresponding authors:
Academic editor:
ASAPbio is an initiative that aims to promote the uptake of preprints in the biomedical sciences and in other life science communities. It organized an initial workshop in February of 2016 that brought the different stakeholders together: researchers, institutions, funders, publishers and others. This was followed by a workshop in May that brought together funders around the concept of preprint services.
In August, a third workshop was held with technology and infrastructure providers to discuss technical aspects of how such services might look and how they would interact with existing standards or platforms. This document is both a report on the results of this third workshop and an exploration of potential next steps.
The use of preprints as a method of scholarly communication varies across research communities. Despite decades of widespread use of arXiv – the preprint server for physics, mathematics, and computer sciences – preprinting is a relatively unfamiliar concept in the biological sciences. ASAPbio has convened three meetings to discuss how preprints could play a larger role in scientific communication in the life sciences. It organized an initial workshop in February of 2016 to bring together junior and senior researchers, journals, and funders (report at
While such variability in preprint servers presents excellent opportunities for innovation, it also generates challenges in terms of discoverability and the adoption of standard practices. ASAPbio has argued that introducing data and screening standards can promote adoption of best practices relevant to posting of preprints among communities of biologists (cf.
ASAPbio has subsequently convened multiple groups to discuss these ideas of aggregation and standardization. The second ASAPbio meeting was a funder workshop in May, the output of which was a request from funders for the “develop[ment of] a proposal describing the governance, infrastructure and standards desired for a preprint service that represents the views of the broadest number of stakeholders” (
The third workshop was held in August with technology and infrastructure providers to discuss technical aspects of how such services might look and how they would interact with existing standards or platforms, which is the subject of this report. The technical gathering was aimed at developing a specification to present to funding agencies for five years of financial support of the Central Service, funds for the operation of a community-supported Governance Body, and potentially other costs that might be related to compatibility of operations with the Central Service.
The resulting documentation and the recommendations (see Table
The workshop was preceded by an informal get-together on August 29, 2016 that was combined with a demo session. During the session, the following tools were demonstrated:
Jeff Spies and Brian Nosek of the Center for Open Science (COS) presented
Dan Valen of figshare presented the
Kristen Ratan presented the
On August 30, the actual workshop took place at the
After the organizers framed the workshop, each attendee offered their name, affiliation, and their ambitions for the ASAPbio effort. Table
A video livestream was provided from room A. The corresponding recordings are available on YouTube via
The goal of each breakout session was to brainstorm reasonable implementations that are open source and interoperable with other services. Ideally, the specifications included details such as estimated development time, development cost, and suitable service providers.
The initial discussion focused on the software architectures of the existing preprint servers. The session considered how these currently operate and included some perspectives on lessons learned and a consideration of some of the design considerations that went into them. Some of these design considerations reflected technological and cultural limitations present at the time the platforms were launched. We recognize that the technological and cultural landscape is fluid and that some past considerations may not be relevant any more.
A target figure for the Central Service was deemed to be around 200K submissions per year. The general feeling was that at this number of submissions, scaling was not going to be a computational issue. Scaling of the system should follow standard best practices for design of web systems.
A key to success will be developing and deploying the appropriate APIs and standards for interoperability. It was noted that the current preprint services don’t have many standards in common, with the possible exception of
Multiple APIs will be needed to serve different functions (e.g., ingest, linking to existing publishers, indexing and searching). APIs of existing services are summarized here; more details are available in the
The
Work is underway to incorporate the OAI-PMH and Crossref metadata APIs. Other API options for search, ingest and export are being explored.
ORCIDs for authors can be included but are currently optional, pending greater adoption
JATS XML is used for metadata
Participants suggested that metadata search could be accomplished via services like Google Scholar (which
The breakout team agreed that dependability in terms of preservation of a manuscript’s intended content and formatting is a requirement. Word is the most common authoring application used in the biology community, and PDF is generally perceived as the easiest viewing format. These offer a barebones approach to do preprints. However, the preprints community must work towards achieving a more robust, versatile approach because these current systems are not desirable as a longterm preprint format.
Participants recommended that vendors/partners must support the barebones PDF approach as soon as possible. However, the RFA must also request proposals for the creation of open source conversion tools. Specifically, all submissions to the Central Service will need a consistent metadata schema based on the Journal Article Tag Suite (JATS). In addition, all submissions should go through a conversion process resulting in XHTML or XML for the body of the files. Tools for quickly creating well-designed PDFs from these converted files should also be developed or made available. There was discussion on the current state of technology development as well as the future standards required to meet the needs of the scientific publishing and research community. This process may require additional proofing/correction stages by authors, depending on the degree to which accurate conversion can be achieved.
The most obvious application of automated screening is likely in plagiarism detection. However, available tools are limited by the corpus of literature that the tool can access, making commercial tools the most functional options at present (see comparison in the
Some screening for non-scientific content can also be automated. At arXiv, the process of automatically classifying manuscripts into different categories catches manuscripts that don’t fit into any category; these anomalies are often non-scientific in nature. Adding such flags to content would aid, but not replace, efforts by human screeners. All manuscripts at arXiv and bioRxiv are currently screened by human moderators, and at least one participant speculated this may be necessary for the foreseeable future. Furthermore, participants suggested that the screening process could include feedback to authors, giving them a chance to correct flagged content before resubmission.
Other automated ethics checks could include requiring authors to check boxes certifying that they have adhered to ethical standards in the preparation of the manuscript. Facial recognition software can be used to identify papers that may contain faces and possibly compromise the identity of human subjects. Signing into the service with an ORCID login may further increase confidence in the quality of papers, at least if the ORCID account is linked to an established record of scholarly publication.
There are other ethics checks (such as detection of inappropriate image manipulation), for which participants felt the technology does not yet reach the ability of humans. Therefore, the participants stressed that it is important for preprint services to create an environment that allows future innovations in automated screening to be added as they are developed. In fact, preprint services could expose a large corpus of manuscripts (including their associated figures) on which new services could be trained. The development of these services could be facilitated by the provision of manuscripts in a structured format rather than as PDFs.
The breakout session began with a discussion of content that screening is intended to prune away. Participants reasoned that plagiarism and spam detection could be assisted with automated technology, which could flag uncertain cases for review by a human. Ethical issues – such as compliance with guidelines governing human and animal subjects and the responsible disclosure of information that could affect public health and/or national security – require more human involvement. Finally, some work – such as pseudo-scientific or inflammatory papers – can only be weeded out by making judgement calls.
Human curation needs to occur both during ingest and after posting. Screening at ingest cannot be expected to catch all problematic manuscripts; it is limited by both budgetary and time constraints, since authors will expect rapid posting. Even barring these constraints, screening could approach, but never reach, 100% efficiency. Rather, content must be moderated after posting as well. Takedown policies need to be developed and uniformly implemented. These policies could distinguish between revocation within 24 hours after posting and withdrawal at a later date, similar to
In addition to simply excluding content, a preprint service could filter content based on various measures of quality (similar to search engine results). In all of these cases, it must be made clear that any screening and filtering is not a substitute for peer review, since only the latter is usually tied to relevant expertise.
Participants favored a model in which an aggregator does not perform screening redundant to that provided by individual preprint servers, but rather collects content from accredited or certified servers that conform to best practices (similar to
The group supported the widely acknowledged position that research data should be properly managed, archived, and made available as best as possible. By and large, this is done by dedicated data repositories trusted by the community. Participants in this breakout session ideally favored a model in which a preprint service is responsible for maintaining text and figures, but not supplemental datasets. Instead, the best practice would be to deposit these files in separate data repositories such as figshare, Dryad, Dataverse, and Zenodo. This approach would reduce data storage demands on the preprint server and reinforce the concept of data as legitimate independent research objects.
In practice, authors could be prompted to deposit datasets during the preprint submission process (and reminded to update their preprints to include references, including DOIs, to these data). The preprint service could also deposit submitted data on behalf of authors and automatically reference it, but this is likely to be an error-prone process. The group acknowledged that data sharing requirements may be difficult to implement without substantial modifications between the technology platforms of the preprint service and data repositories to coordinate the timing of the release of data and preprints. This issue could be addressed, e.g. with the use of embargo processes normally reserved for journals.
Supplemental text and figures (rather than large datasets) occupy a grey area; they are not easily discoverable and thus not ideal locations for underlying data. They require modest storage resources and also could be considered part of the narrative of the paper. Historically, supplemental files originated for a variety of reasons when publishing moved from print to online: some materials like audio recordings or web applications are simply not printable, there were space limits on print content, and digital repositories for such content were not readily available.
The amount and size of supplementary files were capped because digital storage and online bandwidth were much more limited and expensive than today. Journals still often have space restrictions on the main narrative, e.g. on the length of titles, abstracts and manuscripts or the number and resolution of figures. Many journals, even those which are online-only, also resist the inclusion of non-printable matter in the main text. While preprints per se are not necessarily bound by such limitations, those destined for submission to journals are. Even when the manuscript is intended for a venue that does not enforce such limitations (such as this report), authors may separate information into tables, boxes, appendixes, and external repositories. This behavior is driven by convention, convenience, and the need for the narrative to flow with clarity.
Therefore, it would be onerous to require authors to reformat their manuscripts to include all narrative elements in the main text or to independently deposit files they would otherwise include as supplementary material. Simplicity in the submission process is essential for preprints; the bar must not be set so high as to discourage use of preprints. In sum, participants saw a role for the preprint service in encouraging, but not mandating, best practices for small supplemental files.
Different preprint servers have different approaches for maintaining persistent identifiers (PIDs) for each manuscript version. For example, BioRxiv uses Crossref DOIs, PeerJ Preprints uses DataCite DOIs, and arXiv uses its own set of URIs. Regardless of which approach is used, participants agreed that readers viewing old versions must be made aware of new versions with a highly visible notice. While this feature could be implemented with any PID system, Crossref has established a policy requiring preprint-journal article linking and has created a workflow and tools to support this process.
A more fundamental question is, “what is a version?” This problem has implications both for the management of a preprint server and the assignment of PIDs. Specifically, should each change to the manuscript warrant the assignment of a new PID? On one hand, creating additional PIDs can support the maintenance of a precise scholarly record; on the other, more versions may flood users with confusing information (for example, ORCiD receives many complaints about duplicate versions of articles in users’ profiles). To address this issue, Crossref has established a best practice standard of requesting a new DOI only for new versions that contain
It was also noted that centralization by full content mirroring would possibly make propagation and synchronization of multiple preprint versions technically more challenging than with a distributed infrastructure where versions are stored, managed and rendered directly by the respective ingestion servers.
Beyond PID assignment, each new version may warrant other administrative actions at a preprint server, such as automated or human screening or an announcement. At BioRxiv, revisions are subjected to reduced scrutiny compared to the original version. At arXiv, early versions are announced (via email etc.) but later versions are not.
Several open questions remain. First, how should preprint servers support or display the version history of
Many tools already index preprints. These include
New search tools (such as one that could expose content within a Central Service) could be built on Apache Lucene, an open source search engine project. Platforms that use Lucene (
The exposure of metadata and especially full-text data in more than one place complicates the aggregation of metrics, which are important to both authors and service providers for demonstrating the impact of both individual articles and the platform as a whole. For example, PubMed is not
Participants raised the question of whether metrics are important to begin with, as they do not accurately reflect article quality. Metrics are purposefully not publicly displayed on arXiv. The biology community is
The charge to the group for this session included the following: “
Participants felt that the service should be able to interoperate with a variety of potential future tools (for example, overlay journals, alerting systems, commenting and annotation systems, services that perform English language or figure polishing, content classification). Therefore, participants raised the question of which entities or services (beyond the sources of manuscript ingestion) should be able to contribute content or metadata.
Current preprint servers such as bioRxiv and PeerJ Preprints have already worked on pipelines to facilitate submission of manuscripts to journals. The “common denominator” for transferring manuscripts to journals is currently FTP, but content management systems do not adhere to any universal metadata format for ingested manuscripts. Rather, conversion to JATS is performed toward the end of the preparation process. Participants argued that if conversion to rich documents (e.g. first HTML, then perhaps structured XML later) was performed before peer review, transfer between servers and publishers could be eased. In the future, manuscripts could also be transferred by APIs rather than FTP.
Much of the discussion focused on the issue of licensing, which could have profound effects on the technological development of future preprint services. For example, full-text search,automatically extracting links to data, and the development of commenting or aggregation platforms may be inhibited by restrictive licensing. Some participants felt that it is time to “seize the day” and mandate that all content in the CS should be licensed uniformly, under CC-BY or compatible. Others expressed concern that a categorical license might stifle adoption of preprints by alienating journals and potentially dissuading scientists. Reasons expressed include control of content and further dissemination of the preprint by third parties that could compromise later formal publication of the material by the author. Voluntary selection of CC-BY is currently low (~1% in arXiv, ~20% in bioRxiv), though lack of understanding of the consequences may be a factor, as well as the way choices are presented. Access to text and data mining is distinct from license selection in arXiv and bioRxiv, as it is addressed by ‘fair use’ laws and explicit statements on the preprint server. If a mandatory CC-BY policy were enacted, participants felt that funding agencies and other institutions would need to mandate deposit into the CS in order to overcome authors’ fears of disqualification from journals.
After the breakout groups, participants made general comments summing up their impressions of the day’s discussions and the role of a potential Central Service. There was agreement across all participants that preprints provide an opportunity to accelerate the communication of science and to encourage downstream experiments in data sharing and research communication. Furthermore, a modular, open service could not only help to make preprints more discoverable, useful, and available for innovative development, but also to incentivize their adoption as a respected form of communication in the life sciences. Several themes and concerns emerged from these discussions:
Participants emphasized that the core technologies for almost all of the services described already exist. Thus, projects that bring together existing groups and services in a modular way are likely to be efficient solutions. Participants felt that the workshop itself was a demonstration of the value of allowing many voices and players to contribute, and the development of a Central Service should also be a cooperative effort.
Given that an ecosystem of preprint services already exists, future initiatives for preprints should fill in gaps by promoting sustainable, community-governed projects and by providing services that do not yet exist. One such area is in the development of tools for moving beyond the pdf as a standard format. The presentation of articles in XML or HTML would increase access for both machines and humans, especially those using mobile and assistive devices. Indeed, institutions that receive federal funding in the US must ensure that all users, regardless of disability, are able to access content (
Despite the general positive outlook for preprints in biology, there were areas of tension in participants’ opinions. One of these areas was in the timing of implementation of future services. Some participants favored a forward-looking service that would require the development of new technologies for its operation, while others cautioned that the lag time involved would dissipate current momentum behind preprints in biology. On the other hand, settling on suboptimal standards or technologies could hold back preprints. The notion of staging or phasing the development of services (like conversion to XML or HTML) was brought up several times during the meeting as a middle road. Similar concerns applied to content licensing. Proponents of open licensing argued its essentiality in developing a corpus of literature that promotes innovation in scholarly communication and accelerates the progress of science by enabling text mining and other non-narrative forms of display. Other participants cautioned that mandating such licensing could dissuade authors from early deposition of results and discourage journals from adopting preprint friendly policies. Again, it was suggested that licensing could be phased in over time, or that non-perpetual licenses could be employed to ease the transition toward content that is more open.
Perhaps the largest point of contention was the extent to which a service should centralize the roles of the preprint ecosystem. Some participants favored a service that could ingest manuscripts from multiple sources (including from researchers themselves) and directly display all manuscripts to readers. The arguments in favor of this model are that 1) restricting features of the service is inefficient or unnecessary, 2) if the service works entirely in the “background,” its identity and presence may be unclear to researchers, jeopardizing its ability to sustain itself long-term, and 3) if properly governed and sustainably planned, a centralized service could provide a stable, long-term complement to an ecosystem of preprint-related services. Other participants favored a more limited service that would not duplicate the current functionality of existing servers; instead, they favored the development of a scalable distributed infrastructure relying more on interoperability rather than exclusively on centralization (such as community-recognized submission policies, metadata schemas, or search engines) that could augment existing players in the ecosystem. Proponents of this model argued that supporting a centralized service that performs all of the functions of a preprint server (ingestion from authors, display, etc) could become “one ring to rule them all” and might squelch competition and innovation in the preprint ecosystem. In this context, the governance model of the centralized service becomes important for weighing the relative importance of interoperability, innovation and other criteria on a regular basis and with input from the respective communities.
Concerns were also raised about the potential for overspecification. First, participants stressed that different communities have different needs, and there is also debate within communities about best practices. Furthermore, setting rigid metadata, formatting, or screening standards now might restrict potential for future growth in the long run. To address these concerns, participants suggested that standards could be implemented in a modular way so that individual communities could control their own use of the service. Additionally, these standards should be periodically revisited and incrementally modified to reflect changing needs. Finally, participants cautioned that while these issues are important, a poor outcome might result from taking no action. Therefore, they cautioned against “overthinking” these issues to the extent of delaying forward movement toward a next-generation preprint service.
Participants emphasized that the culture of researchers is an important element to consider in selecting an implementation. Adoption of preprints in general or any given service in particular will depend on the rewards and incentives that face researchers. Thus, input from members of the scientific community is needed. In addition to this guidance, an objective analysis of current researcher behaviors is needed. These two streams of feedback should be evaluated on an ongoing basis to help the preprint service develop over time.
The authors of this report recommend that the following issues and principles drive the development of the Central Service.
The term “preprint server” originated in the early days of the internet when the storage methodology was an important piece of a preprint service’s architecture. The technology platform and the preprint service are no longer necessarily tied together in a one-to-one relationship. Many preprint services may use the same technology platform(s), and service providers may arise that handle both technology and production support for several preprint services. As technology and production layers become more modular, other elements of the publishing system can also be separated. For example, journals provide a peer review and curation layer on top of content that could be hosted elsewhere. However, researchers tend to associate the act of sharing their work with a publisher, generally a trusted brand. Separating disclosure layers from editorial ones (such as those provided by journals) will require significant cultural change.
While the idea of a central preprint service has appeal across stakeholders, this appeal is modulated by details of the potential specifications. For instance, a central indexing and search front end (PubMed or Google-style) would be acceptable to most stakeholders because it usefully centralizes indexing and search. Some feel, however, that such a service would be largely redundant with existing (albeit potentially less sustainable) community-provided search engines (such as
In our current preprint ecosystem, any new opportunities for content or innovation need to be negotiated and implemented across multiple systems. This issue can be addressed by creating appropriate centralized services and by defining standards that make distributed resources fully interoperable. A real advantage of interoperable, mirrored or unified full-text repositories would be the ability to easily layer on new services and tools. At last, we could visualize and work with the biomedical literature as a whole, rather than as fragments distributed across multiple platforms. We also would have the opportunity to increase the efficiency of the system, supporting aspects of the workflow that users currently like from different platforms while removing others that are less favorable (e.g, having to re-enter the same information multiple times, having figures and text separate for the reader; difficulty in porting articles from one platform to the next). The trick is to ensure that the service is as easy to use as possible for human authors and readers without closing doors to evolution into a better system for producing and mining text and data.
We believe that any services developed should “meet researchers where they are now.” The interfaces and functions of the service should, at least initially, be predictable and similar to existing tools. The service should place minimal burdens on authors and readers. If any additional burdens are required (for example, additional metadata entry) their benefits should be clearly explained to authors. The service should be open to innovation, and the way that the tool evolves should be driven by the user community. Developers should remember the motivations of researchers (credit, career progression, and convenience).
To fulfil the principle of researcher-focused design, the initial implementation of the service should fit into current author and reader workflows. This includes initial support for Word and PDF files and the smooth (ie, one-click) interaction between preprints and downstream journals. However, full text in tagged format (either JATS or XHTML) will be an important future development.
Nevertheless, we believe that ASAPbio has a unique opportunity to facilitate community investment in improving document converters and central tools/services to use and manage them. Beyond this, responders to the RFA should have the option to extend the service with new features that have yet to be considered.
ASAPbio should work to bring together the life sciences community around the idea of preprints and to define standards for preprint services in this discipline. In doing so, ASAPbio should build on the experience of communities experienced with preprints (such as physics) while also signalling the value of preprints to other communities where they are not yet the norm (such as chemistry). ASAPbio should also help to catalyze partnerships in the publishing ecosystem among preprint servers, the Central Service, journals, and tool developers.
The contents of this article are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.
Attendees of the ASAPbio Technical Workshop (* denotes remote attendees)
|
|
|
John | Chodacki | California Digital Library |
Tim | Clark | Harvard |
Alf | Eaton* | PeerJ |
Martin | Fenner* | DataCite |
James | Fraser | UCSF and ASAPbio organizer |
Lee | Giles | Penn State and CiteSeerX |
Darla | Henderson | ACS/ChemRxiv |
Robert | Kiley | Wellcome Library |
Thomas | Lemberger | EMBO, SourceData |
Jennifer | Lin | CrossRef |
Maryann | Martone | UCSD, NCMIR, Hypothes.is |
Johanna | McEntyre | Europe PMC, EMBL-EBI |
Bill | McKinney | Dataverse, Harvard |
Daniel | Mietchen | NIH |
Brian | Nosek | COS |
Laura | Paglione | ORCID |
Mark | Patterson | eLife |
Jessica | Polka | ASAPbio |
Kristen | Ratan | Coko Foundation |
Louise | Page | PLOS |
John | Sack | HighWire |
Ugis | Sarkans | ArrayExpress/BioStudies, EMBL-EBI |
Richard | Sever | Cold Spring Harbor Laboratory, bioRxiv |
Jeff | Spies | SHARE, COS |
Carly | Strasser | Moore Foundation |
Ron | Vale | UCSF and ASAPbio organizer |
Dan | Valen | figshare |
Simeon | Warner | arXiv, Cornell University Library |
Ioannis | Xenarios* | Swiss Inst. of Bioinformatics |
Documentation of the breakout sessions in notes and video. Links to the start of each session in the YouTube video are provided for convenience, but the entire video recording can also be viewed at
Session ID | Session title | Link to session notes | Video of session start time (h:mm:ss) | Link to video of session | Video of report-back start time (h:mm:ss) | Link to video of report-back |
1A |
|
|
1:37:20 |
|
2:56:47 |
|
1B |
|
|
3:12:09 |
|
||
2A |
|
|
3:40:50 |
|
4:46:05 |
|
2B |
|
|
4:52:08 |
|
||
3A |
|
|
5:09:07 |
|
6:00:47 |
|
3B |
|
|
6:06:58 |
|
||
4A |
|
|
6:41:39 |
|
7:39:49 |
|
4B |
|
|
7:51:08 |
|
Principles and recommendations for preprint technology development
Preprints are meant to facilitate and accelerate scholarly communication. Preprint services should encourage open science best practices. Meet researchers where they are now. Accommodate existing workflows and formats while moving toward best practices over time. Remember the motivations of researchers (including credit, career progression, and convenience). Take advantage of available technology. Preprint technology should be built quickly in a way that can be extended and expanded in the long term by many parties. Allow preprints to be transferred to journals in formats that fit journal workflows. |
Focus on standards. Use schema.org compatible meta-tags and recognized API standards such as OAI-PMH or equivalent. Use the standard persistent identifiers adopted by the community so that we can systematically link up resources, people, and organizations. For example, include person identifiers, document identifiers, identifiers for data, etc., and authenticate them to the extent possible. Make markup consistent. Engage with JATS4R or similar initiatives and follow existing recommendations on tagging. Develop open technologies. Permissive, open licenses on software should be strongly encouraged, and serve as the default for new code written for any ASAPbio projects. Encourage best practices for screening. Manuscripts must be screened by humans before posting, and takedown policies need to be implemented in a standardized fashion. Stay simple. Accept submissions in Word format and display them in PDF from day 1. The originally submitted files should also be retained and made accessible for mining and processing. Support open source conversions. Request and support the creation of an open-source document conversion tool from popular formats like Word and LaTeX to consistent markup (JATS and/or XHTML). Develop machine screening algorithms. To learn from the process, require all manuscripts (accepted and rejected) to be collected along with their screening status to form a database of content; use this to improve machine screening algorithms. Streamline transfers. Support simple transfer of articles to traditional journal workflows. Promote data sharing. The service should make it easy for authors to refer readers to data, software and other relevant materials. Encourage and facilitate deposition of data in appropriate repositories. Directly accommodate deposition of supplementary files (such as figures, movies, and text), which should be given their own unique identifiers and be preserved and indexed appropriately. |