Research Ideas and Outcomes : Project Report
|
Corresponding author: Matthew Murray (thematthewmurray@gmail.com), Megan O'Donnell (mno@iastate.edu)
Received: 26 Nov 2019 | Published: 27 Dec 2019
© 2019 Matthew Murray, Megan O'Donnell, Mark Laufersweiler, John Novak, Betty Rozum, Santi Thompson
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Murray M, O'Donnell M, Laufersweiler MJ, Novak J, Rozum B, Thompson S (2019) A survey of the state of research data services in 35 U.S. academic libraries, or "Wow, what a sweeping question". Research Ideas and Outcomes 5: e48809. https://doi.org/10.3897/rio.5.e48809
|
This report shares the results of a Spring 2018 survey of 35 academic libraries in the United States in regard to the research data services (RDS) they offer. An executive summary presents key findings while the results section provides detailed information on the answers to specific survey questions related to data repositories, metadata, workshops, and polices.
research data, data management, research data services, academic libraries, survey,
The Greater Western Library Alliance (GWLA) is a consortium of 39*
A breakdown of the GWLA member institutions at the time of the survey based on full-time student enrollment, land grant status, and research ranking. Enrollment numbers were provided from GWLA; land grant status and research rankings were obtained from 2017 IPEDS 2017 data
*As The Claremont Colleges are not coded as a single entity in IPEDS they were excluded from the research ranking analysis.
Full-time enrollment | # of institutions | # land grant | R1 Ranking |
---|---|---|---|
7,000 - 9,999 | 3 | 0 | 2* |
10,000 -19,999 | 7 | 4 | 7 |
20,000 - 29,999 | 15 | 8 | 15 |
30,000 - 39,999 | 8 | 1 | 8 |
40,000 - 99,999 | 4 | 2 | 4 |
100,000 and above | 1 | 0 | 1 |
GWLA activities include programs for scholarly communication, interlibrary loan, shared electronic resources, cooperative collection development, digital libraries, staff development and continuing education.*
If the GWLA membership were to work collectively in the future to, for example, share expertise, develop shared repositories, or advocate for minimal core competencies in data management and curation, they would need information about the data services and resources offered by each institution in the consortium. A subgroup of the Task Force was established in 2017 to discover and document the data practices and policies of each GWLA institution.
The subgroup agreed to develop and administer a survey to collect this information from member libraries. The survey was developed in Fall 2017 and sent to the 38 GWLA Library Deans/Directors in February 2018. The final report, included here, was submitted to the GWLA Deans/Directors in May 2019.
Key findings from the survey are presented here. These include library instruction, data repositories, digital preservation, metadata, polices and plans, campus research data services beyond the library, and library organization.
This section only asked about groups unaffiliated with the library. There is likely overlap between services offered by libraries and other groups on campus. However, the survey did not ask for this information.
The most common RDS (research data services) offered by groups on campus unaffiliated with the library are:
The least common RDS offered by groups on campus unaffiliated with the library:
The most common groups offering RDS on campus beyond the library are:
The survey and its results are organized into five blocs:
The full list of survey questions can be found in our OSF repository.
The survey was distributed in March and April of 2018. Invitations to participate were sent to the deans/library heads of the 38 GWLA institutions. Thirty-six of the thirty-eight institutions responded to the survey. However, two of the institutions only provided partial responses, one of which had to be discarded as only 12% of the survey was completed. The other partial response was complete enough to include in the majority of the analysis.
Analysis of the survey answers was only performed at the question level. This was done for two primary reasons. First, the survey was not designed for research or to preserve institutional or individual anonymity. The task was to find out what 38 institutions were doing and how they were doing it. Second, fact checking of major outliers (e.g. a reporting of 5000 library staff) indicated that self-reported statistics were found to be inaccurate and had to be discarded which prevented cross-analysis by library and parent institution demographics. The survey also did not yield sufficient data for any meaningful analysis for annual software and storage costs. Even with these restrictions, the authors believe that the data presented in this report are useful to other academic libraries who are exploring or building up research data services.
As the survey was not anonymous the authors have decided that access to raw, and most coded data, should be restricted to GWLA institution members. Only answers to questions which contain public information have been shared. Links to these data sets can be found within the text.
This section provides a detailed summary of the survey answers and offers no interpretation or conclusions upon the results. Only positive and negative answers were counted, blank responses were discarded. All percentages are rounded to one decimal place. Where indicated ‘n’ is equal to the number of institutions who answered the question and/or the number in the subset the question applied to. The results presented here can be used by GWLA members and other academic libraries as a baseline snapshot of the RDS offered by U.S. academic libraries at the time of the survey deployment.
This section asked for contact information and library and parent institution employment numbers. Results from the demographics section show that GWLA members vary widely in size and staffing levels. However, the data gathered from survey respondents in this section contained multiple major outliers (e.g. 5000 FTE library employees or 38 FTE university faculty) which prevented analysis by institution or library size. An overview of the GWLA member institutions is available in Table
This section asked respondents to provide information on their institution’s RDM teaching activities and services offered, as well as information on campus groups external to the library that offer RDM services.
Which Research Data Instruction services are offered at your institution? Table
The number and percentage of libraries that provide various types of research data instruction. (n=34).
# institutions | % institutions | |
---|---|---|
Course-related instruction | 34 | 100.0% |
Workshops | 34 | 100.0% |
Consultations | 33 | 97.0% |
Please supply titles or topics for the research data workshops your institution provides.
Answers from thirty-two institutions were analyzed (n=32). Workshop titles/topics were assigned up to 2 topic codes based on the information provided (Fig.
Workshop topic code frequencies. Up to two topic codes were applied to each workshop (n=160). Topic codes are defined as follows: Carpentry: a data or software Carpentry workshop; Cleaning: data cleaning and related techniques; Coding: how to work with data via command line or in a specific language; General: the basics of data management; GIS: geographic information system or spatial data/tools; Grants: the word "grants" or the name of a funding agency was explicitly mentioned in the workshop's title or description; HPC: high performance computing; Locate: focused on how to search and locate datasets; Metadata: metadata and data documentation; Mining: focused on text and data mining; Org: data organization; Other: misc. topics or unclassifiable; Plans: data management plans; Repository: addresses a specific repository, how to use a repository, or data repositories in general; Reproducibility: focused on research reproducibility; StorageSec: data storage and/or security tools and topics; Tool: focused on how to use tools related to data and data management (see Fig.
Which of the following research data metadata services does your library provide? (see Table
The number and percentage of libraries that provide different types of RDS metadata services (n=35).
Metadata services |
# institutions |
% institutions |
Help researchers understand metadata and standards to describe their research data |
24 |
68.6% |
Help researchers identify appropriate metadata standards |
23 |
65.7% |
Help researchers apply metadata standards |
17 |
48.6% |
Other |
9 |
25.7% |
None |
5 |
14.3% |
Please identify which, if any, research data services are currently provided by other groups on your campus. Table
The number and percentage of institutions that have RDS services offered by groups external to the library (n=35).
Service |
# institutions |
% institutions |
---|---|---|
Statistical software support |
23 |
65.7% |
Data analysis support |
23 |
65.7% |
Active research data storage and backup solutions |
22 |
62.8% |
GIS and geospatial analysis |
21 |
60.0% |
Data visualization support |
20 |
57.1% |
Assistance locating data storage and backup solutions |
20 |
57.1% |
Dataset purchase, acquisitions, subscriptions |
14 |
40.0% |
Database design and management |
13 |
37.1% |
Data mining |
12 |
34.3% |
Templates or boilerplate for Data Management Plans (DMPs), such as DMPTool... |
12 |
34.3% |
Assistance with completing Data Management Plans |
11 |
31.4% |
Locating and using existing data (includes identifying and suggesting repositories) |
10 |
28.6% |
Topic or How-To Guides |
8 |
22.9% |
File organization and naming conventions |
7 |
20.0% |
Other (with free-text entry) |
3 |
8.6% |
Metadata assistance |
2 |
5.7% |
None (i.e. none of these services are offered) |
2 |
5.7% |
If Research Data Services are currently provided by other groups on your campus institution please identify the groups offering the services.
Groups were assigned a type code based off the names and descriptions provided (Fig.
Types of campus groups that provide RDS (n=103). Type codes are defined as follows: Admin: a campus administrative unit that does not fall into any other category; Center: research centers or institutes excluding HPC groups; Dept = Departments or colleges; HPC: High Performance Computing and research computing units including HPC run by IT units; Individuals: Individual staff, faculty, students, etc.; IT: Information Technology associated with the entire campus, colleges, or departments excluding HPC groups; Lab: Various labs on campus that do not fall into any other category; Research Office: Groups that oversee university research; Other: Groups that cannot be categorized under any other code.
Disciplinary categorization of campus groups that provide RDS (n=34). Discipline codes are defined as follows: Bio: Groups that specialize in biology, including health and medicine; Bio/Stats: Groups that specialize in biology and statistics; Data: no specific discipline but has the word 'data' in the name; GIS: Groups that specialize in spatial and GIS (Geographic Information Systems) data; Humanities: Groups specializing in humanities; Social/Stats: Groups that specialize in statistics and social science; SocialSci: Groups specializing in social science; Stats: Groups specializing in statistics.
Library and institutional research data policies Table
The number and percentage of libraries and institutions (university/college) that currently have policy(ies) that address research data (n=35).
# institutions | % institutions | |
---|---|---|
Libraries that currently have policy(ies) that address research data | 7 | 20.0% |
Institutions that currently have policy(ies) that address research data | 20 | 57.1% |
Follow up questions asked for links to library and institutional policies if they were public. These are presented in Suppl. material
Does your library's or institution's strategic plan or mission address research data services? (n=34)
This section asks respondents questions related to the scope and governance of their institution’s data repository.Thirty institutions responded to all the questions in this bloc, while 5 institutions (14.3%) indicated that they did not have a repository that accepted data and skipped to the last question bloc (Library Organization) .
Do you have a research data repository or a repository that accepts research data? (n=35)
Does your institution have a dedicated repository for research data or is the same platform used for both data and other scholarly materials? (n=30)
Do your institutional repository and data repository system share staff? (n=30*)
* This question was accidentally given to all survey respondents, regardless of if they had a separate data repository or not.
Which of the following are addressed in the policies and/or information pages for the data repository? Table
The number and percentages of institutions that cover various use case topics in their data repository policies or information pages (n=30).
Policy/information topic |
# institutions |
% institutions |
---|---|---|
Who can deposit |
26 |
86.7% |
File types / file formats |
22 |
73.3% |
Sensitive data |
18 |
60.0% |
File size limits |
16 |
53.3% |
Volume or number of file limits |
9 |
30.0% |
Retention periods |
9 |
30.0% |
Required files or documentation |
8 |
26.7% |
Other criteria |
7 |
23.3% |
None (i.e. no policies) |
2 |
6.7% |
Which of the following are addressed in your data repository’s policies and/or information pages? Table
The number and percent of institutions which include legal documents in their data repository policies or information pages (n=30).
Legal topic |
# institutions |
% institutions |
---|---|---|
Terms of Deposit |
17 |
56.7% |
None |
7 |
23.3% |
Other |
6 |
20.0% |
End User Terms of Agreement |
1 |
6.0% |
Which "stages" of data does your data repository accept? Table
The number and percent of institutions that accept data in different lifecycle stages (n=30).
Data stage |
# institutions |
% institutions |
---|---|---|
"Live" / "active" / "raw" data |
9 |
30.0% |
"Final data" |
27 |
90.0% |
“Published data" |
27 |
90.0% |
Are embargo periods available for deposited data? (n=30)
Does your institution limit how long data may be embargoed? Table
The number and percent of institutions that allow various embargo lengths. Only institutions which have a repository that accepts data and embargo periods answered this question (n=24).
Embargo periods |
# institutions |
% institutions |
---|---|---|
0-6 months |
0 |
0.0% |
7-12 months |
1 |
4.2% |
13-24 months |
3 |
12.5% |
More than 24 months |
2 |
8.3% |
No limit |
18 |
75.0% |
What is your library's preservation strategy for the data in the repository? Table
The number and percent of institutions with preservation strategies. (n=30). The number and percent of institutions with each level of preservation practice.
None: We placed Libraries in this category if they indicated that they had no strategy or their strategy was under development
Low: We placed Libraries in this category if they indicated that they backed up data in some way, but were taking no other active preservation measures to ensure the ongoing viability of the data. Example responses include “respository content is backed up and check sums are run nightly”
High: We placed Libraries in this category if they indicated that they placed their data into a preservation system such as the now defunct DPN or if they provided a description of processes to verify file and format integrity.
Level of Preservation |
# institutions | % institutions |
---|---|---|
High - Strategies that exceed “backing up” content | 12 | 40.0% |
Low - Strategy of “backing up” content | 9 | 30.0% |
None | 8 | 26.7% |
N/A | 1 | 3.3% |
This section asked respondents to provide more technical details about their data repository’s operations, costs, and metadata capabilities. Only respondents that indicated that their institution has a repository that accepts data answered this bloc. For this section n=29 as one of the respondents did not complete this section of the survey.
What deposit model is used for research data? Table
The number and percentage of institutions that use different deposit models for their repositories. Some institutions selected more than one answer for this question (n=29).
Deposit model |
# institutions |
% institutions |
---|---|---|
Mediated |
19 |
65.5% |
Mediated only |
14 |
48.3% |
Unmediated |
13 |
44.8% |
Unmediated only |
8 |
27.6% |
Both |
5 |
17.2% |
Other |
3 |
10.3% |
For the purpose of this question, mediated was defined as "subject to review/changes and approval" and unmediated was defined as "no review or approval needed." The "other" choice was a free-text box. Two of the free-text answers indicated that there were plans to move to a mediated deposit model and one provided details on a hybrid model.
Which licenses are available? Choose all that apply. Table
The number and percentage of institutions that reported using various licenses for data in their repositories (n=29).
Licenses |
# institutions |
% institutions |
---|---|---|
CC-0 |
23 |
79.3% |
CC-BY |
17 |
58.6% |
Other |
17 |
58.6% |
(c) All Rights Reserved |
8 |
27.6% |
GNU General Public License (GPL) |
4 |
13.8% |
MIT |
3 |
10.3% |
Apache |
2 |
6.90% |
Mozilla Public License 2.0 |
2 |
6.90% |
Does your repository assign persistent identifiers to datasets? (n=29)
What type of software does your data repository use? (n=29)
Software as a Service (SaaS), aka cloud-based software: 14 (48.3%)
Self-hosted, aka local installation: 15 (51.7%)
What is the name of the platform (software) that the data repository runs on? Table
The number and percentage of institutions that use specific software platforms for their data repositories. Twenty-nine institutions responded to this question however, two institutions reported two different software platforms so n=31 for the percentages calculated in this table.
Name of Software |
# institutions |
% institutions |
---|---|---|
DSpace |
10 |
32.3% |
Digital Commons |
9 |
29.0% |
Dataverse |
5 |
16.1% |
Samvera |
3 |
9.7% |
Islandora |
2 |
6.5% |
Figshare |
1 |
3.2% |
Home grown |
1 |
3.2% |
Two institutions reported using two platforms for their repositories. So, while the number of institutions responding to this question is 29, the number of repositories is 31.
Please provide the names of metadata schema(s) used to describe data the repository. Table
The number of institutions reporting the use of metadata schemas or vocabularies in their data repositories (n=29). Six of the 29 institutions that answered this question reported using more than one metadata schema. For this reason only frequencies are reported inTable
Metadata Schema |
# institutions |
---|---|
Dublin Core |
22 |
DataCite |
8 |
DDI |
6 |
ISA-Tab Specifications |
6 |
ISO 639-1 |
6 |
ISO 3116-1 |
6 |
NCBI Taxonomy |
6 |
OBI Ontology |
6 |
Virtual Observatory |
6 |
EML |
4 |
Qualified Dublin Core |
4 |
Custom |
2 |
N/A |
2 |
Domain |
1 |
FGDC |
1 |
ISO 19115 |
1 |
METS |
1 |
PREMIS |
1 |
Table
Do library staff create or assist researchers with the creation of record metadata? (n=29)
For the purpose of this question: Record metadata is metadata that is searchable and harvest-able.
Do library staff create or assist researchers with the creation of documentation metadata? (n=29)
For the purpose of this question: Documentation metadata is metadata that exists to help others comprehend and reuse the data, such as a readme file.
This section addresses library staffing for Research Data Services and the job titles of library staff who provide RDS.
How many staff at your library provide research data services? Table
The number of institutions reporting the number of library employees that provide RDS (n=34).
# of Employees |
0 |
0.5 |
1 |
2 |
3 |
4 |
4.5 |
5 |
6 |
7 |
8 |
15 |
# of Institutions |
1 |
1 |
4 |
7 |
8 |
3 |
1 |
3 |
3 |
1 |
1 |
1 |
Answers from thirty-four institutions were analyzed (n=34). The most frequent number of staff providing RDS was 3 (reported eight times) while the average was 3.58. The highest number of staff was reported at 15 and the lowest was 0.5 for those institutions with staff that provide RDS.
Please provide an estimate of the combined Full Time Employee (FTE) dedicated to research data services in your library, accounting for the time of all staff involved. Table
The number of institutions reporting on dedicated full time employees (FTE) dedicated to research data services (n=34).
# of FTE |
0.00 |
0.10 |
0.25 |
0.50 |
0.60 |
0.70 |
0.96 |
1.15 |
1.20 |
1.50 |
1.80 |
2.00 |
2.50 |
3.00 |
3.50 |
# of Institutions |
2 |
2 |
4 |
2 |
1 |
1 |
1 |
1 |
1 |
7 |
1 |
6 |
2 |
2 |
1 |
Answers from thirty-four institutions were analyzed (n=34). The most frequent amount of combined FTE dedicated to RDS was 1.50 (reported by 7 times) while the average was 1.34. The highest amount of combined FTE was reported at 3.5 while the low was 0.10 for the institutions providing RDS.
Titles, positions, and departments of RDS staff.
This data was analyzed two different ways. Table
The frequency of words found in job titles from 33 libraries for staff who provide RDS (n=104). The words "and", "of", and "library" were excluded from the analysis. Only words that appeared five or more times are included in this table.
Title word |
Frequency |
---|---|
librarian |
56 |
data |
30 |
services |
24 |
digital |
17 |
research |
15 |
science |
11 |
specialist |
10 |
coordinator |
10 |
scholarly |
8 |
head |
8 |
sciences |
7 |
engineering |
7 |
subject |
6 |
management |
6 |
librarians |
6 |
metadata |
5 |
liaison |
5 |
gis |
5 |
director |
5 |
Coded analysis of the job titles of library staff who provide RDS (n=104). Topic codes are defined as follows: Data = included "data" in title; Digital Collections = included "digital collections" in title; Digital Other = "digital" in title but did not include "collections", "research", or "scholarship"; Digital Research = included "digital research" in title; Digital Scholarship "digital scholarship" in title; Engineering = included "engineering" in title; GIS = included "GIS" or "geospatial" in title; Government Documents = included "government" in title; IR = included "IR" or name of repository in title; IT = included "IT" in title; Medical/Health Sciences = included "medical" or "health" in title; Metadata = included "metadata" in title; Other = miscellaneous titles that did not fit into other categories; Research = included "research" in title; Scholarly Communications = included "scholarly communication" or "scholarly publishing" in title; Science = included "science" but excluded "health" or "social" in title; Subject/Liaison Librarian = included "subject", "liaison", or a discipline (e.g. "social science") in title; Visualization = included "visualization" in title.
Code |
Frequency |
---|---|
Data |
29 |
Subject/Liaison Librarian |
26 |
Research |
12 |
Science |
12 |
Other |
9 |
Scholarly Communications |
9 |
Digital Other |
8 |
Engineering |
7 |
GIS |
7 |
Metadata |
5 |
Digital Scholarship |
4 |
Government Documents |
4 |
Digital Collections |
3 |
IR |
3 |
Medical/Health Sciences |
3 |
Digital Research |
2 |
IT |
2 |
Visualization |
2 |
Is there a Library Committee or Group dedicated to research data services? (n=34)
The authors believe that this report provides a baseline that other institutions can use to compare and measure the research data services they provide at their institutions. Since the survey was limited to the GWLA Libraries, future work could employ similar surveys to capture research data services in other academic libraries in order to gain a fuller understanding of the landscape.
This report, as a snapshot in time, could also be used as a marker for the development of research data services in academic libraries in North America. The GWLA member libraries could be surveyed again in a few years to determine changes in practices regarding research data management in this population of academic libraries.
This report and survey tool have limitations that should be corrected in future work. Some questions were unintentionally ambiguously worded and resulted in data that was difficult or impossible to analyze. Improved testing of the survey by a wider audience and soliciting the services of survey designer would improve question response and corresponding data quality. That being said, we believe it is important to understand the current state of research data services in order to monitor activity and measure future progress.
Active storage: Fast access to storage space, usually used during the active portion of the data life cycle. Example devices/services: external drive, NAS, Cloud DropBox, OneDrive
Amazon S3: Amazon Simple Storage Solution. One of the services provided by Amazon Web Services (AWS). (https://aws.amazon.com/s3)
Arkivum: Cloud service and software that offers long term data management and digital preservation. (https://arkivum.com)
AWS: Amazon Web Services Cloud computing, storage, and other cyber-infrastructure solutions offered by Amazon.
Digital Commons: Cloud hosted solution for publishing, management and showcasing of researchers scholarly output. (https://www.bepress.com/products/digital-commons)
Digital Preservation Network (DPN): A now defunct service for preserving research outcomes.
DSpace: An open-source software system for creating and hosting an institutional digital repository. (https://duraspace.org/dspace)
Dublin Core: A metadata standard used by libraries consisting of a small set of vocabulary terms that can be used to describe digital and physical resources
DuraCloud: An open source, hosted service that makes it easy to control where and how your organization preserves content in the cloud. (https://duraspace.org/duracloud/)
FTE: Full Time Equivalent. A unit that indicates the workload of an employee and used in this survey to measure a worker's involvement in RDS activities. An FTE of 1.0 is equivalent to a full-time worker (i.e. 40 hr/week). Paraphrased from Wikipedia.
GWLA: Greater Western Library Association. (https://www.gwla.org)
Institution: GWLA member. (https://www.gwla.org/about-gwla/members)
LOCKSS: "Lots of Copies Keep Stuff Safe" Program hosted by the Stanford Libraries promoting best practices for digital preservation. (https://www.lockss.org)
Passive storage: Slow access to space which may require long waits for reading and writing. Used mainly for end of project storage of digital content or for recovery of catastrophic data loss. Example devices/services: tape media, AWS Glacier.
RDM: Research Data Management.
RDS: Research Data Services.
Rosetta: End-to-end digital asset management and preservation solution for libraries, archives, museums and other institutions from ExLibris (https://www.exlibrisgroup.com/products/rosetta-digital-asset-management-and-preservation).
The authors of this report would like to thank everyone who answered the survey on behalf of their institution and the Greater Western Library Association for funding the open access publication of this report. Our three peer reviewers: Dr. Kristin Briney, Amy Koshoffer, and Felicity Tayler, supplied feedback that greatly improved the report and also have our thanks.
Lastly, a special thank you to the survey respondent who started a free-text response with "wow, what a sweeping question." You know who you are and you were not wrong.
Survey respondents were informed that results of the survey would be published and made publicly available. This project and survey did not meet the definition of human subject research. As such, it was not subject to Institutional Review Board and oversight.
Links to library and university/college research data management policies.
At the time of the survey there were 38 member institutions.
Greater Western Library Alliance, https://www.gwla.org/ and https://www.gwla.org/about-gwla