Research Ideas and Outcomes :
Workshop Report
|
Corresponding author: Elizabeth Adams (elizabeth.adams@uky.edu), Natalie Raia (nraia@arizona.edu), Isaac Wink (isaac.wink@uky.edu), Doug Curl (doug@uky.edu)
Received: 09 Apr 2025 | Published: 28 Apr 2025
© 2025 Elizabeth Adams, Natalie Raia, Saebyul Choe, Isaac Wink, Doug Curl
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Adams E, Raia N, Choe S, Wink I, Curl D (2025) Engaging state geological surveys in implementing data stewardship practices: a pilot workshop at the Kentucky Geological Survey. Research Ideas and Outcomes 11: e155393. https://doi.org/10.3897/rio.11.e155393
|
|
State geological surveys create and steward valuable long-term earth and environmental science datasets and often serve as physical archives for material samples. Often funded directly through state legislatures, these agencies face varying degrees of support, nuanced regulations and public-serving missions that direct their research and day-to-day operations. Scientists at state geological surveys produce a range of outputs: datasets that may be stored internally, through an institutional repository or disseminated to broader community repositories and publications that may include both grey and peer-reviewed literature. This paper discusses a workshop held at the Kentucky Geological Survey to introduce researchers to data management, sharing and stewardship practices and to better understand obstacles to implementing such practices.
research data management, data stewardship, data lifecycle, geoscience, geological survey
A one-day workshop was held on 10 May 2024, at the Kentucky Geological Survey facility on the University of Kentucky campus in Lexington, Kentucky.
Funding agencies, academic publishers and federal, state and institutional policies are rapidly shaping the open research data landscape. For over a century, academic research data have typically been included as supplementary material in journal publications (and sometimes are not included at all). These data may lack proper formatting and metadata needed to be reusable. With the research community shifting towards making data more findable, accessible, interoperable and reusable (FAIR;
State geological surveys produce and maintain long-term geological and environmental datasets and data products. Many surveys also maintain physical sample collections and face backlogs of samples and associated analogue and digital legacy data, which may cover 100 years or more of research and discovery. State geological surveys are highly variable in terms of their missions and access to funding, limiting their ability to universally adopt open science practices. Geological survey researchers may publish white papers, other grey literature and data papers in addition to peer-reviewed publications and they may face state-mandated data archival policies that are in tension with funder and scholarly publisher guidelines and community best practices for data. Migrating data across decades of changing standards, technologies and ideologies is made difficult not only by the volume of data, but also by missing metadata, various file formats and data structures, mismatched metadata fields across similar datasets and a lack of citation guidelines, for example. Resources, programmatic solutions and staff knowledge are often lacking, creating a large backlog of data that must be modernised to fully realise the public’s investment in those data.
Community organisations such as Earth Science Information Partners (ESIP), the International Science Council’s Committee on Data (CODATA) and the Research Data Alliance (RDA) create spaces for the development of technical standards and best practices that shape open-science practices worldwide. Nevertheless, the direct involvement of geological survey employees in these groups may fall outside the scope of their formal job duties. As a result, knowledge transfer to surveys is limited and perspectives from these unique research entities are unrepresented in national or international data working groups.
To facilitate this knowledge transfer and address this “missing middle” gap, a workshop was developed for researchers at the Kentucky Geological Survey (KGS) to assess current understandings of the data-sharing landscape, hear about current practices and discuss challenges to the adoption of data management best practices. This workshop introduced the framework for FAIR and open science policies, based on federal policies such as the Office of Science and Technology Policy’s Nelson Memo (which directs federal agencies to ensure data are made publicly available within two years of data collection;
KGS was established in 1854 as a part of Kentucky’s state government before joining the University of Kentucky in 1932. Today, KGS is a research centre under the University of Kentucky’s Office of the Vice President for Research. The survey is responsible for researching Kentucky's geology and natural resources and disseminating those results to its many stakeholders. KGS's mission and purpose are similar to those of many state surveys, but its ability and strategy to execute its mission are unique. KGS has a robust in-house digital infrastructure that allows its data, samples and publications to be offered free to the public through its website, digital map layers, web map services and search query applications. This means KGS has the autonomy and knowledge to enforce data management best practices, a rarity amongst many surveys and other state-level organisations. Internal standards have been designed to address the data needs of each of the organisation’s five sections (geologic hazards, water resources, energy and minerals, geologic mapping and geoscience information management). Researchers are responsible for managing their project data under the guidance of the geoscience information management section, an internal digital data working group and the director. They are also expected to submit their research publications and related data to UKnowledge, the institutional repository at the University of Kentucky, to preserve access and register their research products with digital object identifiers (DOIs) to promote findability.
KGS is also the state-mandated repository of oil and gas data (along with related documentation and well-bore samples) and groundwater data. The KGS Earth Analysis Research Library (EARL) houses many of the physical geologic collections and data that fulfil this mandate. Beyond the state-mandated collections, EARL holdings include rock cores, limestone samples, coal samples, soil cores and T-probe soil samples from across Kentucky.
Given the time and financial scope of the project, as well as the existence of a nascent KGS data working group engaged in open science topics, 20 KGS researchers were recruited for the workshop. Two informal pilot activities, a preliminary survey and two focus groups were conducted to assist in scoping and developing the primary workshop content and materials (Suppl. material
The workshop was held on 10 May 2024, on the University of Kentucky campus. All KGS research staff with active data projects that met the criteria for inclusion in external repositories (i.e. adequate metadata, non-proprietary file formats and non-confidential data) were invited. Seven researchers attended the one-day workshop. With respect to career stage, one participant identified as “exploring”, five identified as “establishing” and one identified as “maintaining”. Participants self-reported active research in the following topics: geochronology, geographic information systems (GIS), geomorphology and surface processes, hazards, hydrogeology, remote sensing, seismology and geophysics, tectonics and structure and 3D geological modelling. In accordance with the project protocol reviewed and approved by the University of Kentucky IRB (IRB #93757), no identifying information from participants will be released.
The agenda for the one-day workshop was divided into morning and afternoon segments (Table
Schedule for the Kentucky Geological Survey's data management workshop, held on 10 May 2024, at the University of Kentucky.
9:00 a.m.–10:00 a.m. |
Open Science Overview |
10:10 a.m.–10:30 a.m. |
Break |
10:30 a.m.–11:30 a.m. |
Making Data and Software FAIR |
11:30 a.m.–1:00 p.m. |
Lunch/Listening Session |
1:00 p.m.–2:00 p.m. |
Data Publication Exercise |
2:00 p.m.–2:30 p.m. |
Debrief and Next Steps for KGS |
2:30 p.m. |
Adjourn |
In the afternoon, a hands-on data publication exercise was conducted in a computer lab at the University of Kentucky’s William T. Young Library, where KGS researchers explored repositories relevant to their own data. Since the participants work in various geo-disciplines, this exercise instructed them to identify the most suitable repositories for their data as well as the submission requirements. Throughout the session, authors N. Raia, S. Choe and E. Adams were available for support and to answer questions. Post-session, the group reconvened to discuss the challenges and successes of the exercise and what the researchers envisioned would help them and their colleagues in data publication endeavours.
Six weeks after the workshop, participants were emailed an eleven-question post-workshop survey. The questions focused on identifying behavioural changes resulting from the workshop and what portions of the workshop were most and least effective. Specific focus was given to the researchers’ confidence level when interacting with external domain repositories and the different steps of the data lifecycle.
Several key themes emerged over the course of Q&A sessions and the open lunch discussion. These are summarised below with key quotes from participants, which have been lightly edited for clarity where necessary.
Participants gained a new awareness of federal open-science policies, such as the Nelson Memo (
By the end of the workshop, participants better grasped the importance of searching for and engaging with repositories for their data early in the research data lifecycle. The afternoon data publication exercise was key to instilling this lesson (described in the Benefits of Group Data Publication Exercise section). During the discussion, one participant positively described their experience and how the exercise of submitting data to a disciplinary repository revealed missing metadata in their existing datasets, which spurred change in the data collection process: “We're not collecting enough attributes of the data to make it usable. Unless you try to publish with it, you don't realize that you're missing something, right? You can go back and update the database if you can, or at least start collecting the new data in a different way”.
The discussion also revealed areas where misunderstandings of fundamental concepts related to data publishing persist. For instance, while it is best practice to submit data to one domain repository or one generalist repository, a few participants believed they could submit the same dataset to multiple repositories, with one asking, “My question is, is there any reason why you would suggest to people not to post data, like the same dataset, in multiple repositories”? In order to meet publisher requirements that datasets be deposited in an appropriate repository and have a DOI, some researchers may submit data to generalist repositories (which typically assign DOIs quickly and have limited to no curatorial review processes) during submission of a related scholarly journal publication. Some researchers may do this as a quick solution, intending to later re-submit the same data to domain repositories when they have more time to wait for expert curatorial review and ingestion of the data. This practice is ill-advised, as it leads to duplicate DOIs for the same data, hinders accurate data provenance and linking (which skews usage and impact metrics),and can create research ethics concerns.
Regarding the usage of persistent identifiers for physical geological samples, participants were unclear about whether to create and use International Generic Sample Numbers (IGSNs) for samples destroyed during analysis and what happens to IGSNs once samples are exhausted. One participant asked, “[If you] have [the samples] set up in the repository and then there's none left… Like, the rest of it is destroyed. It [the IGSN] wouldn't go away”? The workshop facilitators explained that IGSNs are persistent beyond the physical lifespan of a sample and are designed to facilitate permanent linkage between the sample metadata and derived research products, regardless of whether the sample continues to physically exist.
Concerns about misrepresenting sample availability and the use of multiple metric systems were discussed, with participants reaching an overall understanding of the role persistent identifiers play in ensuring the citability of work and credit for all contributors. “It's gonna be interesting with… so many metrics.… There's already a lot of, like, author metrics, and it just seems like another stick to beat him with, in some ways. It's like, here's a bunch more metrics. How are you doing in this one? How are you doing in that one? …Somebody's gonna roll all this together at some point in some new index. And that's gonna be weird and interesting too. It happens. Yeah. We have the H-index, we have the I-10, you know”.
KGS researchers expressed mixed feelings about UKnowledge as a host for their data. Their concern stemmed from: (1) a desire to abide by KGS policies requiring them to use their local KGS server to host data and (2) a limited understanding of how UKnowledge and KGS are linked. Researchers believed they were being good data stewards by trusting internal policies and data infrastructure that follow FAIR principles. The limited exposure to larger university services hinders the discoverability of their research products. Researchers were also concerned their citations were less accessible or visible through UKnowledge and would not be cited correctly, resulting in a lower H-index. Similar sentiments were expressed regarding data repositories and journals in which original datasets and publications may not be properly attributed due to the lack of consistent policies and guidelines, the inability to cross-check legacy data and the compilation of data published under a single DOI.
Citation metrics were of great concern for the researchers. The creation of grey literature or datasets without mechanisms to ensure that the resulting citations contribute to their professional metrics made researchers reluctant to host their data in external repositories. One participant described the challenges of receiving credit for data and publications included in highly-cited review papers and noted similar concerns about credit adequately propagating to datasets: “This is why review papers get lots of citations. Write a review paper and then somebody comes and cites the review papers instead of digging down through everything. It's the same thing [that] happen[s] with datasets… How does credit get transferred”?
Participants expressed frustration with a lack of community and disciplinary norms and protocols related to publication authorship and dataset contributors. The lack of consistent policies and education on how data should be credited surfaced in two scenarios: co-author attribution and linking manuscripts to data. A participant noted, “People don't write all their co-authors either. That's the other thing. Because they believe that they're the ones submitting the data. They don't need to put everyone else on there. Authors are not linked either. So, if we have the authors and link them, that's great, but if you don't, then it's like there's nothing much we can do because we don't know... Sometimes people send the submissions. We don't know if your manuscript is [published], this is in publication, it's in review. But that doesn't mean anything to us because we can't actually access your manuscript”? In some cases, it may be appropriate for data submissions to include all authors who are listed in a related manuscript. All authors should be linked via ORCiD, but they often are not, as the submitter may not have their co-authors’ ORCiDs or may opt to exclude them. Additionally, many publishers request that authors submit data prior to manuscript publication to ensure that supplementary data are archived in a repository. Often, data submissions are never updated with the publication DOIs and the DOIs of the datasets are not formally cited or referenced within the manuscript.
Concerns about citations are not unique to earth-science researchers and data repositories are still in the process of implementing scalable citation workflows. A promising infrastructure component is the DataCite Metadata Schema (which UKnowledge uses for its records;
Common themes in data publication discussions are: (1) inconsistent metadata requirements, (2) policy and (3) reusability. Researchers at KGS work in various earth-science disciplines. While many are able to follow FAIR and open research guidelines, there are limitations. Those who work in carbonate (karst) caves, for example, cannot disclose cave locations as part of National Park Service and other federal regulations. Furthermore, state agencies use specific data formats that are not compatible with one another, leading to reduced interoperability. For example, the groundwater data in the Groundwater Data Repository, which is maintained by KGS, is sourced from the Kentucky Division of Water. To maintain crosswalks between these departmental databases, KGS has limited the types of data formats it supports, which can sometimes reduce the useability of the data for the public. One participant remarked, “It'll take you three months to just organize it and make it usable…”. By contrast, other data resources that may not be bound by these regulations can repackage data to make it more user-friendly and usable.
The group data publication exercise marked a shift in participants’ confidence in their ability to understand the data publication and data repository landscape. Participants found it challenging to navigate the domain repository landscape and expressed frustration when repositories required creation of accounts in order to access information about data submission and retrieval.
Prior to the workshop, participants noted that repositories have different requirements and guidelines for data submissions, based on data type and discipline. Some require membership or only accept data from participating countries or institutions: “You have to be part of their membership, or you have to be in that country in order to publish and get a whole bunch of data, is that right”? Restrictions narrow the list of accessible repositories, which are further culled, based on the types of data they accept: “… A lot of people come and deposit whatever data they have… and you have to find the one [repository] that's appropriate for you”. In addition, repository curators do not specialise in all data types and disciplines, making it more challenging for researchers to find the appropriate repositories for their data: “If there isn't [another domain-specific data repository], then I don't know how to help you, because it's not what you do”.
Some participants expressed hesitancy about the credibility of repositories. One participant, for instance, expressed frustration that guidelines and requirements listed on repository webpages do not necessarily align with funder requirements for data publishing: “I'm always just a little bit suspicious about, you know, is this actually reflecting current policies for any of these smaller repositories? And I've even seen this with some of the domain-specific repositories that the NIH recommends. They provide sort of a similar table where they provide information there. And then I go and I actually read the policies on that repository's website. And in my view, they don't actually reflect what the NIH has put out there. They'll say they take open submissions, but then I don't see any open submission portal anywhere on the repository”. While the National Institutes of Health (NIH) is focused on biology and health, it is a major national funding agency that parallels geoscience agencies. Participants also found that, in general, representatives for many repositories are difficult to contact and many never respond: “I try contacting them, and I hear no response from them”. Facing tight budgets and limited staff, repositories may be hard-pressed to keep up with user support amidst the deluge of increased, to-be-curated data submissions resulting from new funder and publisher requirements and a scientific community undertrained in data management practices.
The data publication exercise helped workshop participants consider revising the timing of their data workflows and, more specifically, engaging with repositories earlier in the research data lifecycle. Previously, participants were more likely to reach out to repositories and submit data when their manuscript was under review, as required by many journals. One participant noted, “I am more forward-thinking of how and where my data products will be shared”. Another participant stated, “I really do think that there is no substitute for just establishing that contact early on”. They also noted, “Just make sure that there's a human on the other end of that who can say, yes, this data looks good”.
Post-workshop, participants were more inclined to reach out to repositories early in the publication process, speak to UKnowledge librarians and to seek advice from other KGS staff on making their data public.
Prior to the workshop, KGS researchers had some basic knowledge of UKnowledge and the general services of research librarians. Presentations by K. Bachman-Johnson and I. Wink offered more detail on the services that the University of Kentucky offers to its researchers, including publishing datasets and creating stronger data management plans (DMPs) for research. In fact, following the workshop, several participants met one-on-one with I. Wink to discuss avenues for publishing their datasets, establishing ongoing relationships that persist at the time of this publication.
In future iterations of this workshop or for those considering implementing similar workshops within their organisations, we emphasise two essential lessons learned: (1) in institutions where libraries exist, engage with research librarians at the beginning of the planning process and (2) consider whether IRB approval is needed.
Research librarians are experts in information retrieval, organisation and management and specialise in particular academic fields, offering training on topics including scholarly publishing, citing and archiving research outputs and digital tools and resources. Research librarians likely already have training materials tailored to organisational needs and resources and these types of workshops can foster closer relationships between researchers and research librarians to enable ongoing conversations, learning and better research dissemination outcomes. Research librarians are also uniquely poised to communicate researcher use-case needs and challenges to information and library science communities to drive further research and innovation in knowledge infrastructure development.
IRB approval should be considered if any evaluation work involves participant feedback and if the results are expected to be shared with the public. The IRB approval process protects both the researcher and participants from risks associated with human subjects research, including the misuse of personal data. Workshop facilitators will need to work closely with their institutions to determine what guidelines apply to their research designs and what kinds of applications they should submit. Several months of lead time should be budgeted to work through this process before soliciting information from potential participants.
All slides used in this workshop are available in
We would like to thank all the participants in the workshop and pre-workshop pilot activities for their time and for contributing valuable insights to this discussion. This workshop was made possible through the support of an Earth Science Information Partners (ESIP) 2023 FUNding Friday Award.
Information about the research team's scoping activities before designing the primary workshop content and materials.