Research Ideas and Outcomes :
Research Article
|
Corresponding author: Caroline Fischer (c.fischer@utwente.nl)
Academic editor: Tamara Heck
Received: 10 May 2022 | Accepted: 14 Sep 2022 | Published: 17 Nov 2022
© 2022 Caroline Fischer, Simon David Hirsbrunner, Vanessa Teckentrup
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Fischer C, Hirsbrunner SD, Teckentrup V (2022) Producing Open Data. Research Ideas and Outcomes 8: e86384. https://doi.org/10.3897/rio.8.e86384
|
Open data offer the opportunity to economically combine data into large-scale datasets, fostering collaboration and re-use in the interest of treating researchers’ resources as well as study participants with care. Whereas advantages of utilising open data might be self-evident, the production of open datasets also challenges individual researchers. This is especially true for open data that include personal data, for which higher requirements have been legislated. Mainly building on our own experience as scholars from different research traditions (life sciences, social sciences and humanities), we describe best-practice approaches for opening up research data. We reflect on common barriers and strategies to overcome them, condensed into a step-by-step guide focused on actionable advice in order to mitigate the costs and promote the benefit of open data on three levels at once: society, the disciplines and individual researchers. Our contribution may prevent researchers and research units from re-inventing the wheel when opening data and enable them to learn from our experience.
research data, data sharing, data repositories, open science
Modern computing facilities have given momentum to analysing large quantities of human-centred data in many fields of research. Whereas a high number of small individual research grants and projects has promoted the acquisition of many small-scale datasets, the collection of human-centred data can be challenging for study participants, indicating a need to collect data as parsimoniously as possible. Amongst other advantages, open data offer the opportunity to economically combine data into large-scale datasets, thereby fostering collaboration and re-use of data to treat researchers’ resources, as well as to study participants with care. We build on this idea of open data and explore the challenges and benefits of producing open data when studying humans and their behaviour through the lens of distinct research fields: life sciences, humanities and social sciences. We thereby draw on our own experience as open scholars, as well as the literature.
Below, we characterize practices, processes and challenges of opening and sharing scientific research material and we provide guidelines for researchers to make their data accessible for re-use. We aim at contributing to a growing, interdisciplinary body of literature that treats open science both as a research object and a practice (
In the remainder of this article, we first discuss the characteristics of open data, what qualifies as open data and which formats are reusable. Secondly, we analyse key steps in the research process to produce open data, such as gaining participants’ consent, preparing data and metadata and choosing a repository. We specifically acknowledge the effort and resources needed for these steps and highlight the availability of assistance at universities and other institutions that support the production of open data. Finally, we conclude by discussing the impact of producing open data. In doing so, we analyse long-term and short-term consequences on the societal, disciplinary and individual levels to facilitate decision-making regarding open data practices within academic institutions.
For a long time, the handling of data has been an integral part of scientific practice. More recently, data have been characterised as the key resource (
Data
'Datum' (sg.) and 'data' (pl.) are derived from the Latin word for 'given' or 'something given'. 'Data' typically combine three qualities: a material, symbolic and pragmatic one. From the material perspective, data can both take analogue and digital forms. The focus here is on digital data, which are pieces of information based on binary, electrical pulses. They can be sent from one location to the other as a signal or they can remain in one place, for example, when stored on a medium (
Research Data
According to
For example, in neuroscience and many other life sciences, datasets are often highly modular. Methods that are used to measure brain function produce data as heterogeneous as lists of voltage values (in the case of an electroencephalogram) on the one hand and greyscale values in three-dimensional image spaces (in the case of functional magnetic resonance imaging) on the other hand. If they are to be re-used, these data points need metadata containing references to underlying technology and settings to structure the data streams in a coherent way.
In many social science fields, such as political science, public administration, organisational studies or management, data are often structured along multiple levels, referring for example to countries, organisations within these countries and individuals within countries and organisations. Such a multi-level structure is only useful when links between levels can be clearly identified (i.e. which citizen belongs to which country).
Moreover, qualitative research tends to work with highly unstructured data, which include artifacts, such as field notes, memos, official documents, images, movie clips, diagrams and tabular data. These data typically include identifiable information about concrete individuals, places or events.
To summarise, we pragmatically understand research data as any data produced in scientific processes and/or used in scientific processes. That might also include big data originally produced by social media companies and repurposed for science or data produced in private sector marketing research. Such datasets might relate to different restrictions concerning data sharing.
Open Research Data
What does 'openness' of data mean in a research context? This question has been answered in many different ways depending on one's understanding of openness in science (
By making data available in any form, stakeholders contribute to making research more transparent and, in the best case, reproducible. However, not all open datasets are equally useful for secondary use. While making data available in any form is conditional to making especially published work more reproducible, not all open datasets are equally useful for secondary use (
Nevertheless, the intention of 'opening' is to make data accessible for the respective research community in a more controlled manner, taking into consideration specific methodological and content-related criteria. This means data access might be restricted for scientific use only or researchers need to identify their research interest or their intended methodology before gaining access. This is foremost a pragmatic decision to limit the focus of this article, because sharing data with amateur scientists and other stakeholders often implies a different quality of accessibility, for example, the need to summarise data, plot data or explain data other than just publishing a raw dataset. Such regulated access can be organised by a research data centre or an institutional repository, for example, by offering on-site access to data, via a remote desktop or sharing of just parts of a dataset. Although these restrictions might impede the notion of openness, especially when working with sensitive data, such as personal data or data from vulnerable groups, this is often the only way to open data up at least partially (
The production of open data does not start with sharing a dataset, but usually with the beginning of each research cycle. In every step of a research project, opening datasets then requires additional considerations. Measures need to be taken and barriers overcome. The following chapter discusses these elements for producing open data along the ideal type of a linear research process. In reality, however, some of the described activities and considerations may overlap or occur in a different order.
We suggest to consider the possibility of opening up research data already in the planning phase of a project. Although data might be opened up ex-post, such a procedure is often problematic as participants' consent to publication of their data might miss the necessary legal provisions or formatting the datasets for interoperability may be time-consuming.
Researchers should be aware that handling research data and collection and use of data are nowadays highly regulated by national and international legislation. Whenever research data involve information about individual human beings, production and use of the data are currently, for example, regulated by the General Data Protection Regulation for the European Union. The framework stipulates provisions, such as the residence principle, the right to data transfer, the obligation to protect data through system design, a right of complaint and the sanctioning of violations (
As individual researchers cannot always easily ensure compliance with current or future laws, many funding agencies and ethics boards of universities have taken on this responsibility. Grantees are asked to develop detailed data management plans (DMP) in the application or planning phase. The requirements for these DMPs increasingly address and are favourable to the issue of open data. The German Research Foundation (Deutsche Forschungsgemeinschaft DFG), for example, states that grantees should consider whether and what research data could be relevant for other research contexts and how these data can be made available for reuse. Research data should be “made available as soon as possible” (
As of 2021, funding requirements are even more formalised regarding an open data policy ‘by default’ and it is not possible to opt out of these obligations completely. It is principally possible to opt out of the requirement to provide open access to data and metadata as long as an explanation is given. The implementation of these funding policies is increasingly supported by infrastructure initiatives and networks. Examples, in this context, are the German national research data infrastructure and research data centres and the plans for a European Open Science Cloud.
Apart from funding bodies, ethics boards of research institutions may also require researchers to produce DMPs. This is mostly the case when research activities involve individuals. When phenomena are studied on the meso or macro level (e.g. studying countries, regions or economic sectors instead of individuals), however, there are still many empirical research projects that plan their data management in an emergent process running in parallel with the project, especially when no ethics approval has to be obtained in advance.
A DMP usually describes which kind and amount of data will be collected in which way and how data will then be stored. Several templates (e.g. for Horizon Europe, from the DFG or any other national funder) and even apps (e.g. DMPTool) exist, making it easy to compile a DMP. Step-by-step guides to preparing a data management plan have been published by many research support units at universities and also by the Science Framework (OSF). DMPs prior to data collection help to pre-plan the different steps of data collection, analysis and storage, to fulfil the requirements for data storage (e.g. a specific form of participants’ informed consent requested by a repository) can be implemented. Many organisational or disciplinary data repositories and data centres are ready to give advice on this planning process. Alternatively, organisational research support teams can be contacted to find out which regulations and aspects need to be taken into account.
Even if not requested, researchers - in our opinion - should always use a DMP to ensure that important opportunities are not missed, for example, with regard to opening up data. DMPs are also helpful in establishing important steps in a research process within a team and to avoid conflict later in case team members have different interests, for example, with regard to opening up the dataset.
Despite all this help being available, a DMP will not be written in a day - especially if researchers lack experience. Depending on the level of detail that is requested by a funder, an ethics committee or other stakeholders, the mere writing of a DMP might take several days - not mentioning planning the data collection and storage as such. Hence, it is important to start such an activity well ahead of any deadlines and planned data collection start dates. If a pre-registration (i.e. submitting your full research plan to a registry to separate confirmatory from exploratory analyses) or registered report (i.e. submitting your full research plan to a journal for peer review and possible in-principle acceptance of the resulting manuscript) is an option for the project, a DMP is mandatory and a good starting point for reflecting on the expected structure of the data and intended outcomes. Step-by-step guides to prepare a data management plan are published by many research support units in universities and also by the OSF.
Individuals participating in empirical studies need to give their consent to data collection procedures and they need to be informed about the context of the research. How consent needs to be acquired and the scope of information that needs to be given is regulated by the GDPR in the EU context *
These obligations also have consequences on the open data production process. The ICF must include information on the intention to make the data, or parts of it, accessible as an open dataset. The ICF should further include detailed information about the location the datasets will be uploaded to (e.g. the open data repository). Crucially, current legislation also specifies a requirement to use an explicit opt-in procedure (i.e. participants actively select and agree to a certain procedure regarding data collection) instead of an opt-out (i.e. participants are automatically enrolled in the procedure and need to explicitly state their wish to not take part).
By informing participants about the study and asking them for their consent, we learned that it is useful to separate the opt-in for the general participation from the opt-in for opening up (parts of) the dataset later on. This way, open data constitute no threat to sample size and data from participants who are concerned about opening up data can still be used in analyses. However, such a procedure might result in different datasets (a complete (closed) and a reduced (open) dataset), rendering attempts at reproducing the reported results futile.
Obtaining informed consent from individuals is not always easy as illustrated by members of vulnerable groups who might not be able to understand the given information about a study. For example, individuals suffering from severe cognitive decline, under-age participants or individuals with a limited capacity to communicate, such as coma patients, might not be able to opt-in in a study in an informed way. Solutions are highly dependent on a particular individual, sometimes informed consent being given by parents or other guardians. In other cases, simple language information and a respective consent form might be sufficient.
Another issue to reflect upon is the fact that making participants aware of open data might bias their behaviour within a study. They might be hesitant to talk about sensitive issues fearing that their data might later be de-anonymised. Additionally, insufficient information on which research questions will be pursued, based on their data, may make them cautious about participating. This may especially apply if participants are not motivated to participate in research in general, but have an interest in a certain research topic and are only willing to participate if use of their data is restricted to advancing this topic. On the other hand, opening up their data might also motivate individuals to participate as their efforts can have a higher impact when re-used for several research projects.
While planning data collection, some thought should be given on which data are and are not necessary to collect. Collected data should be as rich as necessary and as sparse as possible to make the best use of the time participants invest. Additionally, researchers should be aware of meta-data that can be automatically collected via certain devices. For example, smartphones that are increasingly used for research on human behaviour have passive sensors (such as geolocation via GPS). These can provide interesting and valuable insight for research projects focusing on movement patterns, for example, but can also pose a risk of revealing sensitive information (such as participants visiting specialist medical healthcare centres). If these data are central to the research conducted, researchers should have planned pre-processing strategies (see next section) to remove any potentially identifying information from the data that are to be opened before release.
Depending on the clarity of the DMP, extensive cleaning should not be necessary (i.e. removing data-points that do not reflect the data acquisition pipeline as described in the DMP due to, for example, hardware failure or human error) of the data. Still, technical errors or disruptions in data acquisition, such as experienced by many researchers during the initial wave of the Covid-19 pandemic, when researchers were not able to invite participants into the lab, can lead to alterations in individual data files (e.g. due to changes in software versions installed on shared hardware used for measurements) that are time-consuming to repair for an individual researcher working on opening their dataset. Importantly, whenever possible, the dataset should only be cleaned to a degree that erroneous data points are removed, but the data has not been fully transformed from raw data into a summary format (for example calculating means instead of reporting the initially collected values). This way, researchers can perform a broad range of data analyses on the open dataset which maximises the utility of the time invested in the first place. Based on the type of data, metadata need to be added which sufficiently describes the data to enable follow-up analyses. In the case of neuroscientific data, the BIDS standard (Brain Imaging Data Structure;
Eventually, the choice of the file format is important to render the dataset accessible to a broad audience and for a maximum duration. For some data types and depending on the field, certain file formats can be the quasi-standard (for example (compressed) NIfTI for neuroimaging data) that can be processed by all standard software packages. If the data do not require a certain file format or if no clear disciplinary standards are available, the simplest representation can be chosen that preserves any structure inherent to the data and proprietary formats can be avoided which make it harder for researchers lacking the necessary software licences to open and process a file (e.g. data formats that can only be opened with certain statistics software). Adding an easily accessible, humanly as well as machine-readable description to the data, makes it easy for researchers browsing a repository to decide if the dataset fits their research question.
However, preparing data for publication might also entail a translation of text data (stemming from documents, interviews or surveys, for example) to make it usable for a broader international audience. Translations often entail induced biases, especially when not done by a native speaker and professional translators are expensive and not always affordable. Ideally, these costs related to opening up a dataset are factored in early in a project.
The standardisation of quantitative data might also induce a loss of information. If data are released in a summarised form to save space or to preserve anonymity - for example, means are calculated over repeated measurements of a variable - a subset of analyses depending on the original raw measurements cannot be run anymore, thereby impeding attempts to reproduce results that have been previously reported. However, modern data transfer rates and online server space are steadily increasing, so releasing the full raw data should be the default. For neuroimaging data, a growing number of online databases that invite the submission of raw data, such as OpenNeuro, allow sharing of big datasets without having your own server readily available.
A different problem of making datasets available for reproduction on various systems pertains to either different statistical analysis packages or different versions of these packages leading to different results (
Although researchers should choose a place for publication and long-term storage of their data early in the project, the data storing and sharing, as such, is the last step in the endeavour of producing open data. In general, selecting a repository depends on a lot of different considerations: regulations by the employing organisation, regulations by funders, disciplinary conventions, normative considerations, for example, location of server. Another aspect to consider is whether the repository is owned by a non-profit organisation and restrictions pertaining to the participants. If restrictions are needed - for example, to a 'scientific use only', so-called research data centres might be the best option. These data centres offer researchers an analysis of restricted data on-site at a workplace in the data centre. They compile scientific use files that differ from public use files or execute the code from external researchers and share results instead of raw data (see, for an example from the social sciences, the Secure Data Center of the German Gesis). If a certain discipline is to be addressed and no restrictions are needed, the best-known repository in a discipline is always a good choice. Usually disciplinary associations or networks can give some advice on preferred repositories. Other multi-purpose archives are, for example, the Open Science Framework, Dataverse and Zenodo. Many universities and research organisations also have developed their own archives or they are collaborating with other institutions to build a common infrastructure. Usually these institutional repositories are intended to ensure long-term storage of data that is nowadays often requested by funders. Advice on general data repositories and the organisational infrastructure can usually be gained from a research organisation’s library or research support staff. They will also advise on a suitable data licence (e.g. CC 0 or CC BY) to ensure the least restriction possible as well as data protection, if needed.
A responsible attitude towards data, long-term goals of researchers and lab funding will, sometimes, even lead to the decision that an embargo period is necessary – or producing and maintaining open datasets in a specific case creates a disproportionate burden for the researchers compared to the low possible value and prospect of data re-use. Embargoes can be issued if researchers acquiring data want to or need to make sure that a certain set of analyses can be finalised on the respective dataset before others can publish results, based on this dataset. This can be realised by:
Issuing embargoes can be sensible if a certain set of analyses is part of a grant funding acquisition of the data or students working towards a degree are dependent on publishing their results first. Towards this end, embargo periods can incentivise opening up datasets in areas where research funding is spent to a great degree on data acquisition and new funding directly depends on high-tier publications, often leading to fully-closed datasets when embargoes are not considered.
With the advent of data publications (
The production of open data is not an end in itself and especially when we revisit the resources that need to be invested to open up datasets, one might ask whether it is worth the effort. However, there are multiple rationales to open and share data in the research context (
Societal level
On the societal level, we might, for example, focus on participants of studies. Data collection always puts a burden on participants, they need to spend their time answering questions or attending lab experiments. In the social sciences, the so-called "oversurveying" of society is highlighted (
Open data are in the long run less burdensome for society because scholars studying similar questions might build on the same dataset instead of collecting data on their own. In the social sciences, some organisations have long been producers of open datasets which can be exploited for a lot of different research questions. Socioeconomic panel studies (such as the German Socio-Economic Panel), public opinion surveys (such as the Eurobarometer) or data that are available from the national statistical offices (such as the British Office for National Statistics) are efficient ways of collecting and using representative country-level data. In psychology and neuroscience, open data are slowly gaining traction after concerns were raised regarding the replicability of several seminal findings (the so-called replication crisis). By now, extensive data-sharing initiatives, such as the Human Connectome Project, have been founded that often foster the usage of the available datasets by hosting competitions on a broad range of research questions (for example ADHD-200).
Apart from the burden on study participants, data collection is expensive and, at the same time, research organisations are often lacking resources. A more efficient use of resources - most commonly tax money - can be achieved by teaming up for data collection and sharing data. To this end, open data support the human right to information: everyone may enjoy the benefits of scientific progress and its applications (Article 15 of the International Covenant on Economic, Social and Cultural Rights). Sharing data, not only in small circles, but opening it up widely, can also help to re-distribute resources to countries in which research might not be well funded and researchers might not be able to collect specific forms of data on their own. Especially when data collection is bound to expensive technology, such as in neuroscience, worldwide use of datasets can also help to tear down knowledge barriers and distribute knowledge more equitably. In a similar way, open data may then also be valuable as an instrument increasing public trust in the results of scientific research in the face of controversy.
Disciplinary level
The most direct impact of open data, however, might be observed in the development of the discipline. Thus, open data affects both - research as well as education. In terms of education, open datasets enable teaching, based on real data and practising data analysis, based on real datasets. Realistic problem-based learning is, thus, possible and students are better prepared than working with artificial teaching datasets.
In terms of research, open data enable innovation and make findings more robust. Whereas innovation is driven by combining datasets and answering new comparative questions, robustness of a finding can only be assessed by re-running the same analysis on different datasets. On the other hand, replications become much easier when data are opened up. The more data-intensive the research, the more dependent replicability becomes on the availability and accessibility of the original datasets for re-analysis. Specifically when using behavioural data, based on task paradigms, slight variations in the implementation of the paradigms are the norm between research groups. If these data are openly available, researchers can investigate the robustness of results across task variations, ultimately providing information for task choice for future studies.
Individual level
Open data - as discussed above - come with several costs for individual researchers or research teams. We argue that, in general, a high intrinsic motivation to publish open datasets is needed. However, we also observe a growing impact of data publications. They count as separate publications instead of just an appendix to an article and might attract additional citations. Still, the value of data publications differs between disciplines and they might be more attractive in the life sciences than in the social sciences. Apart from these metrics, which still count in the recruitment and promotion of academics, open datasets might also generate appreciation and help to build or strengthen an individual network. That again might lead to further collaboration and help in terms of career. Publishing open data might add to a positive reputation of a scholar, either because the discipline relies on shared datasets or because published results can be verified. Being perceived as an expert for open data publication might ultimately also lead to new career opportunities in research support and management.
Due to current incentive structures, producing open data offers a lot of benefit for societies and research fields as a whole, whereas individual researchers doing the work can be at a disadvantage compared to their peers working on closed data sources. Planning for, acquiring and preparing datasets for public release is a serious effort that researchers need to commit to. In comparison, producing closed datasets is less time-consuming as datasets do not need to be made easily accessible and enriched with meta-data. With hiring decisions depending on the number and visibility of publications and funding agencies incentivising the acquisition of new data instead of encouraging the use of existing, extensive open datasets, researchers investing time in optimising their datasets for public release can, thus, lose ground.
Still, apart from the obvious benefits for scientific and societal progress, open research practice is increasingly considered to be a hallmark of "science done right". To this end, individual researchers may benefit from a boosted visibility of their work. Greater transparency supports replicability, thus corroborating trust which is indispensable in scientific discourse. New developments, such as citable data publications, stronger datafication in all scientific disciplines and policy-making towards transparency and replicability by disciplinary associations, national laws and funders also act as incentives for opening up scientific work. Hence, researchers who wish to incorporate open science into their repertoire will find good conditions to do so with ample help available online and within research organisations, as well as academic networks that facilitate building routines for future projects.
Our step-by-step guide summarises the experiences of practitioners across different disciplines and strives to serve as a resource for researchers seeking orientation on how to open up their data. Given the clear benefits for society, disciplines, but also the individual researcher, with current developments in funding, policy-making and hiring decisions increasingly favouring open science approaches, opening up data should be viewed as a default – where ethically unobjectionable – with clear guidance for scholars entering a field.
We want to thank reviewers and editors for their valuable input. We also want to thank Gwen Schulte (DIPF) for proofreading this article.
The publication of this article was kindly supported by RIO. We would like to thank RIO and Wikimedia Deutschland for enabling this collection.
The authors were funded by the Open Science Fellows Programme by Wikimedia Deutschland, VolkswagenStiftung and Stifterverband in 2017/2018 (Caroline Fischer and Vanessa Teckentrup) and 2018/2019 (Simon Hirsbrunner).
The authors are presented in an alphabetical order, all authors contributing equally to the manuscript.
GDPR Article 3(2) applies in two cases when a data processor processes personal data of individuals who are present in the EU, namely:
1. when s/he either offers goods or services to the data subject (market place) or
2. when the data processing serves to observe his/her behaviour (observation place). With this provision, the applicability of European data protection law is no longer linked to the establishment of the controller, but also depends on the location of the data subject in the EU.