Research Ideas and Outcomes : Review Article
|
Corresponding author: Will James Gregg (willjgregg@gmail.com)
Received: 31 Jul 2019 | Published: 05 Aug 2019
© 2019 Will Gregg, Christopher Erdmann, Laura Paglione, Juliane Schneider, Clare Dean
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Gregg WJ, Erdmann C, Paglione LAD, Schneider J, Dean C (2019) A literature review of scholarly communications metadata. Research Ideas and Outcomes 5: e38698. https://doi.org/10.3897/rio.5.e38698
|
The purpose of this literature review is to identify the challenges, opportunities, and gaps in knowledge with regard to the use of metadata in scholarly communications. This paper compiles and interprets literature in sections based on the professional groups, or stakeholders, within scholarly communications metadata: researchers, funders, publishers, librarians, service providers, and data curators. It then ends with a 'bird's eye view' of the metadata supply chain which presents the network of relationships and interdependencies between stakeholders. This paper seeks to lay the groundwork for new approaches to present problems in scholarly communications metadata.
Scholarly communication, metadata, metadata supply chain, research data management, Metadata 2020
The network of those who prepare and publish scholarly research is made up of individuals who work in many different professions, disciplines, and institutions worldwide. They share a common interest, however, in that “All of those who are involved with the publication of scholarly works have the same end goal: to conduct, facilitate, and/or communicate research” (
After laying out definitions for central terms, this review organizes the literature into different categories based on professional group: researchers, funders, publishers, librarians, systems and service providers, and data curators. It then presents a subset of the literature, which attempts to capture the bigger picture: the network of relationships and interdependencies among all of these groups. The final section discusses new standards, guidelines, and initiatives for improving metadata and offers preliminary suggestions for solving current problems with the scholarly communications metadata supply chain.
The idea for this literature review originates with Metadata 2020’s Researcher Communications project. A number of the members of this project identified a need for a comprehensive review of the challenges, opportunities, and gaps with metadata in scholarly communications with the aim that it would foster further conversations among the stakeholders involved. The review itself was researched and written from August of 2018 to April of 2019. Initial structure and guidance was provided by a sub-group of Metadata 2020 participants, after which the author performed literature searches, summarized articles, and categorized them based on their research topic, output type, and relevance to particular stakeholder groups. The author sought to limit the review to resources published within the last 10 years and attempted to include literature which was written by or about each stakeholder group. In meeting this goal, the review sought resources from outside traditional venues for scholarly publishing; it includes a number of white papers, blog posts, newsletters, videos, and other grey literature (literature that does not pass through traditional academic publishing channels). Throughout the process, the author regularly collaborated with participants of Metadata 2020 in seeking out literature and structuring the content of the review. In March of 2019, a draft of the review was edited by a small number of Metadata 2020 participants. In April, the review was published and opened to comment via RIO journal.
This review assumes at least a basic knowledge of how metadata records are created, shared, and used. Nevertheless, two core terms are defined below to frame the discussion.
Scholarly communication is the process by which research is conducted, transformed into content, and distributed to a wider audience. The majority of the resources featured in this literature review are concerned with the physical and life sciences. The implications of this review, however, can be extended in many cases to the social sciences and humanities which have the same need to describe and share scholarly works. The majority of resources also treat the published journal article as the primary form of scholarly output, though emerging literature increasingly focuses on research data and the research process as opposed to finalized articles.
Metadata in this context is the information that accompanies the various stages and outputs of research. Common to most scholarly research are metadata elements such as author, date, title, subject, language, and standard identifier. In the case of research data, metadata describes specialized aspects such as the geographic location where the data was collected, the name and identifier of the research funder, the institutional affiliation of the researchers, contributors such as editors and data curators, or the number of the grant award funding the research (
Publishers, service providers, researchers, funders, librarians, and data curators are the stakeholders of metadata in scholarly communication. The relationships between these stakeholder groups have serious implications for scholarly communication metadata. For example, funders have requirements for researchers, stipulating that the source of the funding must be stated in their research outputs. Publishers ask researchers to supply metadata about themselves and the subject of their projects during the submission, editorial and publication process. Vendors, who package content from publishers and re-sell it to libraries and others, use publisher metadata to keep track of their products including ebook and journal articles. Librarians then use and enhance metadata supplied by publishers and vendors to make their articles discoverable.
The “supply chain” of metadata is best elaborated in the literature by studies that examine stakeholders’ relationships with one another.
Examinations of the researcher-funder relationship are also present in the literature, most frequently revolving around discussions of research data management (RDM) and sharing. Some propose best practices for RDM and data sharing for publishers, funders, and researchers (
Studies of the relationship between librarians and researchers are also present but somewhat less prevalent. Two authors who examine this relationship in depth,
Yet other research has focused on metadata problems and opportunities for single groups, such as the call of
The articles referenced above and many others share a method of grouping stakeholders by professional or institutional identity.
Academic publishers distribute scholarly research by making journal articles, books, theses, and data available online or in print. Though publisher metadata is sometimes criticized, metadata’s return on investment is increasingly recognized and publishers are projected to invest more in quality metadata. Publishers are also on the forefront of adopting emerging technologies for automatic generation of metadata including full-text semantic analysis. Other stakeholders stand to benefit from hearing from publishers directly about their day-to-day practices.
Of all the stakeholder groups, calls to improve metadata quality fall perhaps most frequently on publishers. Publishers could “clean up the information that they provide to vendors about their items, which would help vendors create higher quality records” (
On the other hand, there is an observable trend of publishers giving metadata more serious consideration. According to a 2017 survey of industry leaders, 90% of all publishers are planning to invest in metadata over the next three years (
The reason for this shift in focus is in part due to the incentives that have been associated with good metadata. Two white papers by Nielsen, a publisher with a presence in the US and UK, found increased sales to be associated with quality metadata. Specifically, a quantitative analysis on the sales of Nielsen’s top 100,000 book titles in the period between July 2015 and June 2016 showed higher sales for titles which included basic metadata, a set of 9 elements from the Book Industry Communication standard: ISBN, title, format/binding, publication date, Book Industry Standards and Communications (BISAC) subject code, retail price, sales rates, cover image, and contributor. Titles with these elements had sales 75% higher than those with incomplete metadata. The study also found that supplemental descriptive metadata such as author biography, reviews, and title description, give a boost to sales. On the whole, each element of descriptive metadata boosts sales, with an increase of 72% for titles having 3 descriptive elements over those having none. (
Moreover, investments in developing technologies for metadata promise substantial rewards. Some publishers are exploring AI-generated keywords and abstracts as well as chapter-level metadata which may lead to more granular search functionality (
New technologies pose interesting opportunities for metadata, but their effectiveness is not yet explored in the literature possibly because of their proprietary nature. The extent of the use of semantic analysis within publishing as a whole has not yet been surveyed. Equally lacking is a deeper look at the roles of traditional catalogers and metadata experts in publishing companies, a study of which could provide more clarity regarding the differences between various publishers in the way that metadata is created and used. Again, proprietary concerns might be a cause, in addition to publishers who think of themselves more as self-contained units rather than part of a larger metadata ecosystem. The same concerns may prevent adoption of share controlled vocabularies which would aid in making discoverability easier across publishers.
Scholarly communications service providers (vendors) create tools and platforms to disseminate and facilitate the use of scholarly research. They include library system vendors (such as ExLibris, OCLC, EBSCO), E-retailers (Amazon), publishing services companies (Cenveo, Overleaf), and metrics organizations (Altmetrics, BibExcel), among others. Depending on the nature of the service offered, one service provider’s use of metadata will vary significantly from another’s. Such inconsistencies cause problems for resource discovery and for accessing full text content. By contrast, initiatives for open metadata and usage data, transparent pricing and contracts, allowance for community input, and use of well-established international standards to promote interoperability would bring positive change of the sort modeled in other industries.
Metadata problems can arise when indexing practices differ between publishers and service providers. Publishers may index content at a different level than service providers. For instance, a database vendor might index several small articles under one title when a publisher indexes them discreetly. Service providers may not index some but not all of the articles in a journal issue, or they may index each article in an issue but not the issue itself (
Problems also arise in the area of linking to content. Service providers create platforms, such as Integrated Library Systems (ILS), that allow users to discover bibliographic resources. These platforms feature links to full-text versions of content housed on external websites. The act of clicking on a link and arriving at the correct resource, known as link resolution, can be adversely affected by inconsistencies in metadata between content vendors, publishers, and providers of platfoms.
Inconsistent or incomplete use of metadata fields also makes it a challenge to disambiguate similar titles, journals, and authors across distribution systems. If the metadata utilized in these systems is not normalized to authority sources or linked to unique identifiers, it is difficult to tell if two similarly-described resources are in fact the same. This serves as an inconvenience for individual researchers but also impacts our knowledge of the scholarly ecosystem as a whole. One research study detected significant problems in researchers’ use of bibliometrics to analyze author networks because of lack of disambiguation. Analyses which falsely equate authors with the same or similar names distort our understanding of author networks and “may result in ill-informed decisions about research policy and resource allocation” (
Organizations such as the Scholarly Publishing and Academic Resources Coalition (SPARC) and the Confederation of Open Access Repositories (COAR) have developed best practice principles for data repositories which can also be applied to vendor services. Among the principles are calls for open metadata and usage data, transparent pricing and contracts, allowance for community input, and use of well-established international standards to promote interoperability (
One challenge for other stakeholders in understanding the metadata practices of service providers is the large number of providers and tools offered. The literature would benefit from a comparison of different kinds of platforms and tools offered by service providers. Platforms and tools could be classified according to the impact of metadata for collection, curation, or consumption. Librarians in particular benefit when service providers are willing to adopt open metadata and usage data, transparent pricing and contracts, allowance for community input, and use of well-established international standards to promote interoperability.
Researchers generate the ideas and data that precede all publications. As such, researchers are also a starting point for metadata: they use their own metadata while organizing their writing and research data, and are usually responsible for formally submitting metadata to a publisher or data collector when a project is ready to be released to the public.
While researchers understand the value of making their work available to others, literature on the subject draws attention to the fact that the quality of their metadata does not always follow suit. Authors may be required to submit metadata along with traditional articles or monographs. Author-supplied metadata may benefit from the author's expertise, but is also, writes
On the whole, however, recent literature has focused more on researcher-generated metadata for data sets than on metadata submitted for published articles. On the subject of metadata for research data, a consensus emerges that researchers, while invested in the idea of sharing their data and findings, often lag behind in practice. Two studies of researcher perspectives on publication of data find that “researchers frequently fail to make data available, even when they support the idea or are obliged to do so” (
The ability to manage large sets of data is increasingly important for researchers across disciplines. In the sciences, however, educational and training programs for researchers are still catching up. In one case, researchers published their experiences adopting new practices to manage data for the Ocean Health Index project. They found that “the need to improve practices is common if not ubiquitous” among environmental scientists who work with large data sets (
The learning curve which researchers must negotiate to manage research data is not the only barrier for researchers who want to publish their data with appropriate metadata. Researchers must also manage the requirements and standards, sometimes conflicting, of different repositories, journals, and funders. In “Data management assessment and planning tools," Andrew Sallans and Sherry Lake find that “researchers’ current practices appear fragmented largely because funding agencies propagate broad requirements and provide few resources” (
Interestingly, researchers’ reliance on informal sharing might create a disincentive for quality metadata. In exploring how researchers share and obtain data, direct contact rates higher than all other methods, including sharing and retrieval via data repository (
Again, specifically in regard to data sets, a lack of shareable metadata has contributed to relatively low citation rates for data sets in published articles. While 49% of researchers thought that citation would be the most appropriate way of giving credit for data consulted (
Researchers have a strong incentive to create good metadata. The quality of metadata associated with their scholarly outputs will affect the number of citations they receive, while utilization of quality metadata early in the research process will facilitate organization and save time down the road. A more selfless incentive, researchers also want to see knowledge advanced in their fields through better findability and wider access. These interests can be seen at work in the wide adoption of the Open Researcher and Contributor ID (ORCID), an initiative to assign a unique identifier to every author. These unique identifiers prevent confusion when authors change names or share a name with another individual: if a person publishes under the name Michaela Schaal in 2019 and then publishes an article under the name Michaela Petsch in 2020, it will remain clear through use of an ORCID that these names belong to the same person. Similarly, Michaela could publish as Michaela M. Petsch, M. M. Petsch, or any other configuration while remaining unambiguously associated with all her work. Searching by ORCID, researchers can be more certain that they are retrieving all the works of a given author and none from unintended authors.
A similar effort applied to the citation and sharing of data could yield powerful results, though not perhaps without shifting away from the model of traditional citation (
A robust collection of literature profiles the problems that researchers encounter when managing research data, supplementing research on similar issues in the context of published books or articles. However, it remains difficult to attain a bird’s eye view of the subject given the variety in the ways that researchers approach metadata: Metadata for research data varies based on funder requirements, data management software, data repository specifications, and research discipline and training. Likewise, author-supplied metadata for published works will vary according to the fields required by journals/publishers, type of metadata submission form or tool, and, research discipline. For both research data and publications, researchers’ level of involvement in creating metadata may vary based on the requirements of their institutions or publishers.
Studies that employ methodologies such as that of
Though research policy makers (mostly in government institutions) support the accessibility of research data (
Organizations that fund research range from governments to corporations to foundations. Common sources of government funding in the United States include, among others, the National Science Foundation (NSF), the National Institutes of Health (NIH), the National Endowment for the Humanities (NEH), the National Endowment for the Arts (NEA), and the Andrew W. Mellon Foundation. As research funders, these organizations have power to mandate how research data is maintained and distributed, both of which impact the researchers and institutions who generate and maintain metadata.
Government funders have strengthened requirements for researchers to make their research publicly available. For example, in 2008, NIH began to require researchers to submit the peer-reviewed, manuscript versions of their papers to PubMed Central, an open access database for medical research (
Beyond the wider benefits to the scientific community from access to open data, both private and public funders stand to benefit from more rigorous metadata requirements. For example, to measure the output of its Very Large Telescope (VLT) instruments, the European Southern Observatory (ESO) mandates that any article using data gathered from a VLT instrument include the ID of the project under which it was gathered in a standardized format (
Foremost, the literature would benefit from studies which provide a systematic comparison of funder requirements for research data metadata as well as any guidelines provided, schemas recommended, or platforms specified. Additionally, government funding agencies are better studied than corporate funders even though, at least in the United States, corporate entities were responsible for 72% of research and development funding in 2015 (
Libraries are the places where end-users with varying levels of expertise often seek out resources and thus encounter metadata. The challenges that librarians face in making their resources accessible are often the problems faced by the metadata supply chain as a whole, with each stakeholder group having an impact on metadata quality for library resources.
In “How libraries use publisher metadata,” Steve Shadle demonstrates how library discovery systems are impacted by publisher-generated metadata. (
Poor metadata quality brings into question the value of working with large numbers of publisher- and vendor-supplied metadata records. Records are generated by a variety of automatic methods, all of which are aimed at increasing efficiency when working with a large number of resources. Overall efficiency is somewhat diminished, however, if librarians are left without reliable ways to access a resource and if users of library databases see search results that are less than optimal.
Given the nature of these challenges, some see librarians as the best advocates for the end user and the health of the system overall because they “represent end-users who experience specific resources not ‘value for money’ packages” (
Librarians are also perceived as experts who can benefit other stakeholders.
A useful contribution to the literature in the confluence of librarianship and metadata would be a study that takes up the aforementioned question of
As with researchers, publishers, and funders, librarians are a large group working in many different kinds of institutions with different subject knowledges. Hearkening back to the alternative metadata roles first mentioned in the Stakeholders section of this review, librarians are metadata creators, consumers, and custodians all rolled into one. While the differences between a cataloger, a health science librarian, and a reference librarian may be clear to those within the profession, the metadata needs of these various roles are not immediately apparent to those coming from the outside. Use cases and narrative examples written as far as possible in plain language would serve well in elaborating metadata problems for other stakeholders. Organizations like Metadata 2020, the Dublin Core Metadata Initiative (DCMI), and the North American Serials Interest Group (NASIG) each attempt to build our understanding of the metadata needs of different roles within the profession.
As can be seen in the sections on researchers and funders, much recent literature is focused on managing metadata for research data. As of yet, this function is not performed by one kind of institution. Though ‘data curators’ are in some cases professionals with a specific title and a particular skill set, data is curated by many who do not carry that title. As far as possible, this section attempts to describe the challenges and opportunities faced by all of those who work with metadata for research data.
The focus on research data management in scholarly communications literature is in part due to increased interest in the problems of reproducibility and accountability in research. Without access to data, the validity of a claim in an article rests on the professionalism of the authors and the reputation of the journal. Concerns about integrity have reached the general public through TEDx*
Others have expressed concern that the traditional journal article does not model the scientific process correctly or is not the most effective way to facilitate new discoveries. “Papers provide an account that excludes research aspects and outputs that are not directly relevant to the arguments at hand,” writes
A movement toward public access to data has followed and has attempted to create best practices for the infrastructure necessary to make data available. Data infrastructure is the collection of hardware and software used to support the preservation and retrieval of data, and relies heavily on good metadata. For instance, storing data with an eye toward long-term preservation requires technical metadata elements such as file inventory, file format (FITS, SPSS, HTML, Stata, Excel, tiff, mpeg, 3D, Java, CIF), file structure including “organization of the data file(s) and layout of the variables,” version information including a date/time stamp for each version, checksum values, and the software and hardware in which the data were created (
A closely related issue is the challenge of citing data. As referenced in the Researchers section, proper metadata plays a role in preserving the most important incentive for researchers to publish data, that of certainty that their data will be cited. In
Proper methods for storing and citing research data are still in development, but a number of initiatives have emerged with best practices to guide researchers and publishers when creating robust metadata for citation by others. With regard to citation, The Research Data Alliance’s Data Citation Working Group published a list of recommendations for citing data*
With regard to creating metadata for research data management, Force11 published the Findable, Accessible, Interoperable, and Re-usable (FAIR) principles in 2016 with associated metrics for evaluating machine actionable metadata and data quality. DataCite has created a metadata schema for data sets used by many repositories such as Elsevier’s Mendeley-based repository (
In “A metadata-driven approach to data repository design,” Harvey et al. outline the process for creating a repository in alignment with the FAIR (Findable Accessible Interoperable Re-usable) principles. DOIs are first assigned to collections of data, e.g. groups of "logically related material" (
The promise of new initiatives has generated excitement but leaves questions as to their effectiveness which can only be answered as time passes. There is an opportunity for future research to quantify the adoption rates and impact of practices recommended by these initiatives. For instance, while Harvey et al.’s recommendation that DOIs be assigned to data collections, data sets, and individual data points would usher in a new level of transparency in scholarly research, a significant question remains as to whether researchers and data curators have the resources to implement it. The world of research data management would benefit from a service such as Crossref Participation Reports, which provides member organizations with an instant evaluation of their metadata quality and suggests ways to make improvements (
Moreover, it has yet to be seen whether adoption of initiatives by some researchers and organizations, but not others, will create a discordant state of practice that is itself a barrier to access for researchers who seek to study scholarly outputs in the aggregate. It may be difficult to ascertain which citation practices were in effect for whom and at what time.
The literature detailed above, in addition to profiling the issues surrounding individual stakeholders, also demonstrate the ways in which stakeholders depend on one another. It is clear from the literature on librarians, for instance, that library discovery systems and institutional repositories are impacted by publisher and vendor metadata. Publishers, in turn, derive metadata from researchers, whose actions are guided by their personal choices and requirements of funders. The whole process impacts the ability of the end user to find the data and publications they need, from which point the process begins again.
There is a small body of literature dedicated to understanding the entirety of these complex interactions. Within the time scope of this review, attempts to diagram the scholarly communications lifecycle begin with the development of a hierarchical model, consisting of multiple diagrams, which detail the stages of research (
Diagrams provide a useful overall understanding of the scholarly communications ecosystem but, as of yet, do not include any particular focus on metadata. (One diagram does outline the integration of metadata into “search services,” (
The author would like to thank those who lent their generous support as contributors to this literature review. Christopher Erdmann, Clare Dean, and Laura Paglione provided guidance as to the overall structure and arrangement of the review and helped the author to make contact with experts in the field. Julianne Schneider and T. Scott Plutchak edited a draft of the review and provided valuable comments. Michelle Urberg provided extremely thorough comments on the final draft of the review and, with Alice Meadows, co-chairs Metadata 2020’s Researcher Communications project which gave rise to this review. The author benefitted from a workshop graciously hosted by Henning Schoenenberger of Springer Nature. The review draws on resources created by the various project groups of Metadata 2020 and thus owes a debt to all of its participants. Finally, the author would like to thank individuals who responded for requests to lend their subject expertise including Stacy Konkiel of Altmetric and Dimensions and Nena Moss of DOE/OSTI.
Metadata 2020
Alice Meadows and Fiona Counsell of Metadata 2020 have defined these roles, or personas, on the Metadata 2020 website: http://www.metadata2020.org/blog/2019-06-27-the-metadata-cast/