Corresponding author: Stephanie Simms (
Academic editor:
The workshop took place on February 20 in Edinburgh, UK, as part of the 12th International Digital Curation Conference (February 20-23, 2017). The workshop materials are available via
With growing interest in active, actionable data management plans, the Digital Curation Centre (DCC) and University of California Curation Center (UC3) at the California Digital Library took the opportunity of the IDCC17 conference to convene a workshop on the topic. The aim of the workshop was to understand different stakeholder requirements and bring together a diverse international group to develop specific use cases for machine-actionable data management plans. The workshop participants included 47 people from 16 different countries, representing funders, developers, librarians, service providers, and the research community. In practical, brainstorming exercises, the groups discussed use cases focused on interoperability with research systems, leveraging persistent identifiers, evaluation and monitoring, and repository and institutional perspectives, and prioritized future work. The DCC and UC3 will use the workshop outputs to implement and pilot the use cases in the DMPRoadmap platform.
Data management plans (DMPs) are becoming commonplace across the globe as a result of funders requiring them with grant proposals, but they are not being employed in ways that truly support the research enterprise. The current manifestation of a DMP—a static document often created before a project begins—only contributes to the perception that they are an annoying administrative exercise. What they really are—or at least should be—is an integral part of research practice, since today most research across all disciplines involves data, code, and other digital components.
Conversations about the need for machine-actionable*
To advance the idea of machine-actionable DMPs (maDMPs), we conducted a landscape survey of existing tools and standards and began
This report represents the outputs of the
Why are you motivated / excited / required to work on data management / DMPs?
What are your pain points?
What do you hope to get out of this workshop?
The responses revealed several areas of common interest, including the perception of the DMP as a hub or connector for different services, the potential to use the DMP as an advocacy and training tool to support researchers, and the desire to share the information in DMPs dynamically with a variety of research stakeholders and information systems. Full responses are available on Zenodo (
We introduced eight broadly defined topics for the IDCC workshop and asked participants to vote on them. The topics are ranked below according to the interests of workshop participants who organized themselves into groups to develop and prioritize use cases for the top four, while another small group undertook evaluation and monitoring. The topics are all interconnected, and the use cases reflect this; e.g., use cases involving ORCID IDs crosscut multiple groups. We summarized the use cases for each topic, categorizing some of them under the topics that were not explicitly covered during the workshop but where they naturally fall.
Interoperability with other research systems
Leveraging persistent identifiers (PIDs)
Institutional use cases
Repository use cases
Data discovery and reuse
Evaluation and monitoring
Disciplinary tailoring and recommender systems
Publishing DMPs
The central theme of the workshop was interoperability and exchange of information across research systems. Groups considered various systems ranging from Current Research Information Systems (CRIS) to manage project details, funder systems, electronic lab notebooks (ELNs), active storage and repositories, and publisher systems. The need for common standards emerged as a top priority and is the main use case below. Another priority area is some form of integration with funder systems, since funders drive many of the requirements. The ultimate goal is achieving interoperability with a range of systems used by different stakeholders throughout the research lifecycle.
A nice-to-have, but not necessary, next step involves developing a common interface and default implementation in a variety of programming languages to enable a common way of accessing information in maDMPs. As a consequence, all tools and systems involved in processing research data can be extended easily to be able to provide and access information to/from a DMP. For example, a workflow engine can add provenance information to the maDMP, a file format characterization tool can supplement it with identified file formats, and a repository system can automatically pick suitable content types for submission and later automatically identify applicable preservation strategies.
Creating a common interface will increase the interoperability between systems and enable continuous testing of availability of systems and referenced resources. It will also enable validation of information provided, for example, by checking whether a provided DOI links to an existing dataset, if hashes of files match to their provenance traces, or whether a license was specified.
Furthermore, it will improve interoperability between repositories, especially in cases where a project generates different kinds of data that may be deposited in multiple repositories. maDMPs would maintain the links to all these individual datasets, thereby preserving the context in which the results were produced. Currently, such datasets are connected through associations with a dedicated publication, but this approach does not work well across multiple publications from a project, or if there are no dedicated publications, nor if outputs other than data are shared (e.g., software), or if some of the data ends up in supplementary files to publications.
Additional requirements in this area include that maDMPs:
Must make use of existing vocabularies (e.g., Crossref Funder Registry) and ontologies whenever possible to enable machine actionability;
Must employ common exchange protocols (e.g., JSON), including lightweight protocols for big data operations;
Must be open to support new data types, models, and descriptions;
Must link to data and identifiable entities such as people, repositories, and licenses, thus enabling validation and scalability;
Should be available in a format that can be rendered for human use;
Should accommodate versioning to support actively updated DMPs.
All participants view PIDs as a key ingredient in the transition to maDMPs because they would enable information to be passed across existing workflows and systems to plan resources, connect outputs, and automate reporting and monitoring. However, they noted that PID education, for researchers and other stakeholders, is a prerequisite for maximizing their potential value. This is not an insurmountable hurdle, as many researchers already recognize some PIDs, e.g., DOIs in the context of article citations, or field-specific ones like Genbank accession numbers. However, basic data literacy training should include a primer on PIDs: what they are, advice on how to use them appropriately, and why they are important/useful.
PIDs enable automated associations that support reproducibility, data discovery and reuse, tracking usage and impact of research outputs for professional advancement, infrastructure funding, etc. Specific examples of PIDs that contribute to reproducibility include
Use PIDs to prepopulate sections of a DMP for which information is already available elsewhere (e.g., identifiers about the institutions, funders, people, infrastructure, and resources involved). The respective resources could then be notified about reuse;
Notify a repository or other infrastructure provider (e.g., supercomputing or sequencing center, ethical review board) when named in a DMP. Include key information, e.g., volume of data, file formats, licensing, and expected timeline;
Derive a description/identity of objects that do not yet exist; machine-sourced/generated metadata (e.g., about a dataset that will be generated and deposited in the future);
Notify a funder and/or institution when a dataset is deposited in a named repository and relay metadata and any associated IDs;
Notify authorities (institutional or governmental) when legal or policy requirements for data management (e.g., cases of reportable diseases) have been met;
Notify a research project when legal or policy requirements referred to in their DMP have been updated;
Notify a researcher when a repository accepts preservation responsibility for a dataset;
Pass information about grants, projects, and/or research outputs across profile systems to alleviate the need for manual entry. Automatically generate a CV or Biosketch;
Notify a research project when a new release of a software library they are using is available;
Identify publication of data (e.g., associated with a journal article or stand-alone data publication);
DMP of project B listing project A as a dependency allows project A to track reuse;
DMP of project A notifying DMP of project B that project A now has additional datasets available of the kind that project B has started to reuse;
Aggregation of DMPs for active, past, or upcoming research projects at the level of institutions, funders, repositories, authorities, instructors, etc. or even topics (e.g., public health emergencies like the Zika outbreak) for reporting, mining, teaching, and planning of future research and infrastructure.
Institutions (especially universities) are significant stakeholders in the RDM landscape and often have data management policies and/or DMP requirements of their own. Many participants brought this perspective in their roles as university administrators, data librarians, and technologists. They noted myriad challenges related to connecting people, resources, systems, and policies within an institution, as well as providing training and outreach services. Capacity planning was another high-priority application of maDMPs. Next steps for DCC and UC3 include modeling the flow of information within some pilot institutions to understand what can be passed between DMPs and existing systems (i.e., Offices of Research, library, IR, faculty profile systems, etc.) and test the use cases below.
Help researchers choose appropriate tools: e.g., high performance computing (HPC), ELNs, secure storage, file transfer services;
Connect researchers with ethics review, auditing, reporting, and other institutional workflows pertinent to the research described in a DMP;
Connect researchers with training opportunities and consultation services; flag incomplete answers and offer training/advice;
Map DMPs to domain-specific workflows to offer tailored templates and guidance;
Connect researchers with IT services: get updates on data needs as a project progresses; budgeting for RDM; forecast storage and preservation needs, other infrastructure planning;
Integrate DMPs with active data storage (both local and cloud), computation, and ELNs to inform storage allocation/purchase and enable immediate access to data;
Integrate DMPs with access management for data and associated research outputs.
An integral component of this and other use cases is the need to make DMPs an open resource. Ideally, they should be publicly available in line with open, transparent, reproducible research objectives—open DMPs would demonstrate good research practice and facilitate data discovery and reuse. We acknowledge that the culture change toward greater openness is slow and uneven across the academy, and so at the very least DMPs should be shared within an institutional setting*
Data repositories play a key role in the long-term management of data, ensuring that it is preserved and remains accessible. The majority of DMP requirements ask researchers to identify an intended data repository, recognizing that repositories are more suitable than other commonly used media (e.g., hard drives, project websites). It is rare, however, for repositories to play an active role in the data management planning process. One instructive exception is the Natural Environment Research Council (NERC) in the UK, which has designated data centers. NERC-funded researchers identify which one they will deposit in at the grant application stage and then
Connecting researchers with community-curated lists such as
Mining extant DMPs for a specific funder and/or discipline and making recommendations based on the top-cited repositories;
Once the data type is identified in a DMP, information could be provided on the top 10 repositories where data of this kind has been deposited in the past. Additional filters could be offered to highlight trusted digital repositories and those that assign PIDs (as in the re3data catalogue);
It could be useful to filter by repositories used by the researcher’s own institution.
Any recommender services should have different functionality for data generators and reusers, instructors, tool developers, institutions, funders, and others.
Of the nearly 40,000 DMPs that have been written so far with DMPonline and the DMPTool, very few are available in ways that would help people, machines, or institutions find out about the research and data they describe. In the following, we consider approaches to using maDMPs for discoverability and reuse; the approaches also rely on making DMPs public, versioned, and aware of PIDs. These four dimensions each confer benefits for data discovery in their own right, but each of the possible combinations—which can also be explored independently—increases the benefits substantially.
To find out about updates of the record associated with a PID mentioned in a DMP is hard for a DMP available only on paper or in some unstructured format like PDF, although the manual process outlined above could be repeated at additional points in time, as is the case with the
The current manifestation of DMPs is not well suited to automated compliance checks, but this is a critical need for review processes to scale. When DMPs are not evaluated for quality and a poor DMP is perceived to be of no consequence, policies are quickly undermined. Participants focused on determining checkpoints and making recommendations for more structured DMP content. They also noted that funders as well as reviewers need training in DMP evaluation, and that evaluation rubrics would help everyone assess plans (cf.
This is among the most challenging issues, yet it ranks as a high priority. All stakeholders emphasized the need to offer relevant guidance at appropriate points throughout the research lifecycle rather than the current approach of asking broad, unstructured questions at the planning stage when few details are known for sure and presenting generic RDM best practices as guidance. Data management strategies can vary dramatically between and even within disciplines, and so the wisdom has been to leave it to researchers and/or research communities to determine their own standards and best practices. Only a handful of communities that benefit from standardization have cohered around common practices (e.g., genomics research) and some others are beginning to follow suit (e.g., fMRI brain imaging). Although the culture change is slow and precisely targeted guidance may never be available for all disciplines, there are opportunities to hook into some existing systems and databases. There was consensus during the workshop about the urgent need to experiment with serving up more helpful guidance and improve the DMP experience for everyone.
To this end, the DMPRoadmap project is developing some pilots. Repository recommender services (described above) are one obvious area for experimentation. RDA funding is available to test an integration of the RDA metadata standards catalogue. Tagging or filtering by community/disciplinary affiliation might facilitate these efforts.
Biosharing.org is a disciplinary partner with an API that can be used to connect researchers to a curated database of resources for the biomedical and environmental research communities as they are writing a plan. The
Additional opportunities should be identified and drawn into the maDMP discussion. This is also an important consideration when developing common standards for DMPs; i.e., we need an expressive format with lots of optional fields to accommodate different disciplines.
There are growing trends towards both informal sharing and formal publication of DMPs. Opening DMPs brings many benefits and is something we actively encourage. DMPs can be aggregated and mined to identify trends or aid discoverability and reuse of data. They also serve as a useful training resource; many institutions refer to “good” examples to help other researchers get started.
Resource type and DMP publishing options: Numerous stakeholders expressed a need to define a resource type for DMPs to distinguish them from datasets and other research outputs (e.g., when deposited together in repositories). This should be supported by DataCite and other common metadata schemas.
There is a strong desire to assign DOIs to DMPs in order to link DMPs with related outputs of a project such as publications, datasets, and software (see Leveraging PIDs). This would aid reproducibility, as the context of the research and all the outputs could be shared together. It is necessary to think through the implications of assigning DOIs to DMPs, however, especially if we aim to support a lifecycle approach with dynamic updating. At a minimum, there should be two versions of record for a DMP: one submitted with a grant proposal and a second one at the grant closeout/reporting stage. Another way to think about this could be in the context of software versioning: every commit has an ID on GitHub or elsewhere, but a DOI only gets assigned to the subset of versions submitted to a repository or publisher. This also dovetails with ongoing efforts in the RDA and elsewhere to define best practices for citing dynamic data.
Various entities are testing the idea of DMPs as a publishable unit to promote greater openness and enhance their value to researchers. For instance, DMPs form
DMPRoadmap already supports sharing DMPs within an institution or openly in a
This document presents a list of community-generated maDMP use cases. It also articulates a consensus about the need for a common standard for maDMPs to enable future work in this area. At the RDA 9th Plenary meeting in Barcelona during the
Recommendations regarding DMPs will also be made via the European Commission’s
The DCC and UC3 will continue to pursue international collaborations related to DMPRoadmap through pilot projects. As part of an iterative process for developing, implementing, testing, and refining these use cases, they will model domain-specific and institutional pilot projects to determine what information can realistically move between stakeholders, systems, and research workflows. There is some existing funding to support a subset of this work; the organizations are actively seeking additional sources of funding to carry the project forward.
Existing funding includes an RDA Europe collaboration award to support embedding the Metadata Standards Directory and biosharing.org resources into the DMPRoadmap platform. The biosharing integration will support the biomedical research community and taps into larger initiatives such as ELIXIR. OpenAIRE funding will support an export to Zenodo feature, and EUDAT will contribute to further API development.
Another disciplinary pilot project involves partnering with the NSF-funded BCO-DMO to use its
In addition to the upcoming RDA meeting, we will circulate these use cases with the FORCE11 FAIR DMPs group and identify additional opportunities to connect with international maDMP initiatives as well as working groups in related pursuits (e.g., controlled vocabularies, decision trees for data management). We will continue to collect feedback on these use cases and facilitate discussions about how to prioritize our next steps as a community, ideally through use-driven experimentation in multiple directions.
The list of participants is given in Table
We acknowledge and appreciate the contributions of those who participated in the workshop. Many additional individuals and organizations have helped shape the ideas and use cases herein. This project is supported by an RDA/US Data Share fellowship awarded to S. Simms, sponsored through a grant from the Alfred P. Sloan Foundation #G-2014-13746.
The Data Documentation Initiative defines
Anecdotal evidence suggests that sharing within institutions is feasible: DMPTool administrators report a ready willingness among researchers to share DMPs within their institution, although not always with a wider public audience, and DMPonline users have requested an institutional sharing functionality.
List of workshop participants.
|
|
|
Alex Ball | University of Bath | United Kingdom |
Amber Leahey | Scholars Portal | Canada |
Andy Riddick | British Geological Survey | United Kingdom |
Benjamin Faure | INIST-CNRS | France |
Bev Jones | University of Lincoln | United Kingdom |
Brian Riley | California Digital Library | United States of America |
Chuck Humphrey | Portage Network | Canada |
Daniel Spichtinger | European Commission | Belgium |
Daniella Lowenberg | California Digital Library | United States of America |
David McElroy | Birkbeck- University of London | United Kingdom |
Falco Hüser | Technical University of Denmark | Denmark |
Fernando Aguilar Gómez | IFCA | Spain |
Fernando Rios | Johns Hopkins University | United States of America |
Gene Melzack | The University of Sydney | Australia |
Heila Pienaar | University of Pretoria | South Africa |
James Wilson | UCL | United Kingdom |
Jari Friman | Helsinki University Library | Finland |
Jez Cope | University of Sheffield | United Kingdom |
Jimmy Angelakos | EDINA, University of Edinburgh | United Kingdom |
John Chodacki | California Digital Library | United States of America |
Joshua Finnell | Los Alamos National Laboratory | United States of America |
Kevin Ashley | Digital Curation Centre | United Kingdom |
Lisa Johnston | University of Minnesota | United States of America |
Margreet Bloemers | ZonMw | Netherlands |
Mari Elisa "Mek" Kuusniemi | University of Helsinki | Finland |
Marie-Christine Jacquemot | INIST-CNRS | France |
Marisa Perez | Universidad Autónoma de Madrid | Spain |
Marisa Strong | California Digital Library | United States of America |
Marta Teperek | University of Cambridge | United Kingdom |
Michael Heeremans | University of Oslo | Norway |
Michael Moosberger | Dalhousie University | Canada |
Myriam Mertens | Ghent University | Belgium |
Patrick McCann | University of St Andrews | United Kingdom |
Pedro Principe | University of Minho | Portugal |
Peter McQuilton | University of Oxford | United Kingdom |
Peter Neish | The University of Melbourne | Australia |
Poppy Townsend | Centre for Environmental Data Analysis | United Kingdom |
Rachael Kotarski | British Library | United Kingdom |
Ray Carrick | EDINA, University of Edinburgh | United Kingdom |
Rob Hooft | DTL | Netherlands |
Roman Ujbanyai | VUB | Slovakia |
Sarah Jones | Digital Curation Centre | United Kingdom |
Stephanie Simms | California Digital Library | United States of America |
Thilo Paul-Stueve | Kiel University | Germany |
Tomasz Miksa | TU Wien | Austria |
Weiwei Shi | University of Alberta Libraries | Canada |
William Michener | University of New Mexico | United States of America |