Research Ideas and Outcomes :
Research Idea
|
Corresponding author: Daniel Mietchen (daniel.mietchen@ronininstitute.org)
Academic editor: Lyubomir Penev
Received: 10 Sep 2022 | Accepted: 14 Dec 2022 | Published: 29 Dec 2022
© 2022 Shweata Hegde, Ayush Garg, Peter Murray-Rust, Daniel Mietchen
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Hegde SN, Garg A, Murray-Rust P, Mietchen D (2022) Mining the literature for ethics statements: A step towards standardizing research ethics. Research Ideas and Outcomes 8: e94685. https://doi.org/10.3897/rio.8.e94685
|
Ethical aspects of research continue to gain attention, be that in the process of proposing and planning research or performing, documenting or publishing it. One of the ways in which this trend manifests itself is the increasingly common addition of ethics statements to publications in fields like biomedicine, psychology or ethnography. Such ethics statements in publications provide the reader with a window into some of the practical yet typically hidden aspects of research ethics. As more and more publications are becoming available in full text and in machine readable formats through repositories like Europe PubMed Central, we propose to mine the literature for ethics statements and to extract information about the various aspects of research ethics that they address. The more standardized these statements are, the better the mined materials can be converted into structured and queryable information that can in turn be used to inform efforts towards higher levels of standardization in research ethics. This paper sketches out the motivation for such mining and outlines some methodological approaches that could be leveraged towards this end.
text mining, Wikidata, ethics committees, ethics process, ethical approval
Ethics is a key component of the way humans interact with each other and with their environments, including in research contexts. Research ethics provides a framework and guidance for making and evaluating decisions touching upon intellectual, social, legal, practical, cross-cultural and other dimensions of research and the context in which it is situated (e.g.
In some research fields - particularly those involving human subjects, animal experimentation, biodiversity or cultural heritage - the formalization of ethical norms and expectations has many decades of history (e.g. see
As formalization progresses, it tends to raise attention to ethical matters related to individual steps of research workflows, ranging from requesting ethical approval to documenting informed consent and providing ethics statements in funding applications or publications (e.g.
Much of the process behind ethical review of research remains hidden (e.g.
As illustrated in Fig.
Ethics Statement from
These frequently include
the legal or policy basis for handling these issues on an international level (e.g. the Declaration of Helsinki) and/ or within a given jurisdiction or institution;
the procedures followed to conform with these legal requirements, along with information about the role of key stakeholders in the process (e.g. approval by an ethics committee, or informed consent by donors and participants, or protocols for anonymization, or (parts of) organizations where the research was performed);
the aspects of the research - if any - that pose ethical issues (e.g. acquisition of personally identifiable information, or animal experimentation or involvement of minors or prisoners).
This kind of information may assist others in engaging with the research that was performed, with the underlying methodology or the resulting data, with research projects of a similar nature or with education about matters related to said research.
While the majority of ethics statements refer directly to the research described in the respective publication, some such statements - particularly in certain types of reviews - refer to ethical aspects of cited publications, often summarizing the information for several of them using more generic phrases than in individual-article ethics statements. An example is given in Fig.
To ensure that ethics statements are present in publications when required by applicable policy or legislation, it is important that ethics-related information is available in a structured format to both humans and machine. This aim is in line with the FAIR principles (
F indable by everyone involved in the publishing process - authors and their co-authors as well as editors, reviewers, publishers and readers, along with any tooling that assists them in matching features of the reported research to relevant policy elements;
A ccessible to the above stakeholders and their tool chains;
I nteroperable across studies, institutions, journals, funders and others involved in research ethics workflows;
R eusable in another context (e.g. a different clinical, geographic or demographic focus).
At present, FAIR information about ethics is an exception rather than a rule, and we argue that this should change if ethical aspects of research are to receive proper attention.
Once the ethics statements are present where they should be, another set of considerations revolves around standardization of these statements: are all necessary pieces of information present, and are they expressed in a way that allows them to be compared, aggregated, assessed for compliance with applicable policy or otherwise used across studies?
Here, several factors come into play, for instance
Policy elements - what information is required by what part of which policy that is applicable to what aspect of the respective research;
Checklists with standardized “boilerplate” language for each policy element;
Machine actionability of these policy elements and their corresponding textual representations in the checklists.
In order to assist in the standardization of research ethics and associated documentation, we propose to do the following:
mine ethics statements from full-text articles using dictionaries (cf. Fig.
extract associated entities (e.g. subject areas, policies, authorities or research facilities) and vocabulary (e.g. terms and phrases related to handling informed consent or incidental findings);
assess the degree to which the language or other aspects of these statements - e.g. their location within a publication - are already standardized;
reconcile the extracted entities and vocabulary terms with Wikidata items and lexemes;
prototype and facilitate the creation of open infrastructure and automated workflows that allow to look up and query information about the research ethics landscape in general as well as ethics approvals in particular, along with the corresponding processes, standards, entities and vocabulary.
An example dictionary for text mining, containing various seed terms in a structured format that can be easily expanded. Each entry consists of three parts:
Below, we will outline some use cases and practical steps towards implementing these ideas.
Ethics statements contain information about ethical aspects of the research reported in the respective manuscript. Having straightforward access to such information may assist readers in engaging with said research or with research projects of a similar nature. Possible scenarios here include researchers wanting to pool their own data with that of the reported study, or wishing to repeat the study under slightly different conditions (e.g. involving a different demographic, location, time of the year or medical procedure). Other scenarios include patients or members of their social circles trying to find out about clinical trials to potentially enrol in, funders or institutions that wish to monitor compliance with their policies (e.g. as per
If the relevant information in the ethics statements were available in a standardized fashion, this would allow for it to assist discovery in such scenarios. For instance, the terms used there or the relationships between them could be reused for parameterizing searches or for filtering search results. To achieve such standardization, communal language and ontologies or other forms of structured terminologies need to be created, and the process of creating them in turn assists in forming, strengthening or otherwise engaging such communities.
To demonstrate the feasibility of implementing the core ideas presented here, this section provides some methodological background, focusing on workflows that we used for prototyping.
The full text of many biomedical articles is available via the literature repository PubMed Central (PMC) and its partner sites like EPMC. The articles can be accessed in several formats, usually including HTML, XML and PDF. Particularly suitable for mining is the XML format, which follows the Journal Article Tag Suite (JATS) specifications. JATS formally supports a wide range of section types and includes provisions for ethics statements. Much of the PMC and EPMC content predates both the current JATS version 1.3 and the dedicated recommendations (
To ensure that key elements of ethics statements are discoverable at scale by interested people, organizations or their tools, these elements need to be integrated into a coherent environment that is aware of the communal conventions and that can be curated by relevant communities. One platform that meets these criteria is Wikidata - a sister project to Wikipedia that can be considered the edit button for the semantic web. Wikidata hosts public domain data from across multiple domains of knowledge about a wide range of entities (referred to as items, of which there currently are about 100 million). These items are semantically annotated by a global community of thousands of curators using information extracted from reliable sources, including scholarly publications and thousands of databases. Due to their breadth of coverage, their granularity, ease of use and the broad integration with other resources, Wikidata items have great potential to assist in the identification of entities encountered in text mining.
Besides items - which are defined in a largely language-agnostic way, Wikidata has begun to build a similarly annotated collection of terms and phrases (referred to as lexemes, of which there currently are about half a million) that the World’s languages use to describe the underlying concepts, and it keeps track of semantic relationships between the items and lexemes. We thus propose to make the information mined from ethics statements available via Wikidata by curating the Wikidata entries for the respective items and lexemes and named entities.
Software for accessing Europe PMC and similar repositories exists in several programming languages. We chose here to develop a Python-based pipeline that builds on a software suite originally implemented in Java a few years back and currently being developed as a tool called docanalysis (
Ethics Statements mining pipeline. Works identified through a search query are being retrieved in full text, the text is then searched for key terms from the ethics dictionary to identify ethics-related article sections, which are then partitioned into sentences that are parsed to try to identify named entities. The results of the mining can be compared to entities and terms known from Wikidata and/ or the dictionaries, which can be continuously improved in an iterative process that can lead to a controlled vocabulary and eventually an ontology for ethics statements, ethics committees and related concepts.
We will discuss this process on the basis of the example use case of extracting information about ethics committees. However, the approach can be generalized to extracting other information, be that related to ethics - e.g. approval numbers, consent types, applicable policies and guidelines - or beyond, e.g. data availability (cf.
First, we use pygetpapers (
Next, we use docanalysis to decompose each article’s XML into sections that can be analyzed independently. We can split the downloaded papers into sections based on the JATS tagging. Some of the section headings are predetermined (e.g. `abstract`) but most others (such as subsections and paragraphs) are determined by the author, journal or publisher.
Ethics statements are normally contained within a single paragraph (some with only one or two sentences. There are two main methods of retrieving these:
Sometimes, both methods are required; context to find the relevant paragraphs and content to find the relevant sentence(s).
To extract information on ethics committees from the sentences/sections we previously retrieved, docanalysis is using libraries like spaCy that provide techniques like unsupervised Named-Entity Recognition (NER - see recent review by
Sentences with phrases present in the ethics dictionary are selected, while other sentences are filtered out. The retained sentences are then parsed through spaCy, allowing to extract strings pertaining to ethics committees. These entities can then be added back into the ethics dictionary for more refined searches (cf. the section “Creating iterative feedback loops between the mining, curation and annotation of ethics statements” below).
After extracting the ethics committee information through NER, we can convert it to structured data. These data can then, for instance, be overlaid to the original text (e.g. as per
Part of the result of a Wikidata query for ethics committees. committee stands for the Wikidata entry for a given ethics body, and committeeLabel for the corresponding label in English. To access the live results, use https://w.wiki/4$GC. Such queries can be refined further, e.g. to enrich the above list with examples of research approved by these committees, to get a list of publications with information about the ethics bodies that have approved the underlying research or a list of topics for which publications have reported ethical approval. Most of the current entries in the list were the result of testing our pipeline, so the information associated with them is often minimal. However, once these entries exist and are linked to other entries (e.g. for the parent organization), they become part of the community curation workflows on Wikidata, which can in turn enrich the mining efforts over time.
committee |
committeeLabel |
Ethics Committee of the American Society for Reproductive Medicine |
|
Ethics Committee of the University of Debrecen |
|
Cambridge Local Research Ethics Committee |
|
Institutional Review Board of Fujita Health University |
|
Institutional Review Board of the Chulalongkorn University Faculty of Dentistry |
|
Ethics Committee of University Hospital Hradec Kralove |
|
People’s Hospital Ethics Committee |
|
Research Ethics Committee of Galway University Hospitals |
|
Beaumont Hospital Ethics Committee |
|
Hartford Hospital Ethics Committee |
|
Scotland A Research Ethics Committee |
|
Biobanks Ethics Committee of the University of the Witwatersrand |
|
Committee for Ethics in Research of the University of São Paulo |
|
National Ethics Committee of Senegal |
|
Human Research Ethics Committee (Non-Medical) of the University of the Witwatersrand |
|
Saint Barnabas Medical Center Institutional Review Board |
|
Inrae-Cirad-Ifremer-Ird joint ethics advisory committee |
|
Institutional Review Board of Sanyo-Onoda City University |
|
Emirates Institutional Review Board for COVID-19 Research |
|
Local Ethics Committee of Medical University of Silesia |
Entity extraction using Wikidata can be further enhanced by incorporating information from corresponding Wikipedia entries (cf.
An ontology of ethics committees and Institutional Review Boards (IRBs) can be created via Wikidata and used via the Wikidata SPARQL service. This ever-updating resource can then be used to aggregate and visualize ethics committee information extracted from the wider scientific literature. For instance, one could ask questions like which ethics committees have approved a particular study, or studies on particular subjects, involving specific demographics, using particular interventions or funding sources.
Large search engines are usually optimised for terms and synonyms, not higher levels of concepts like “ethics”, and they often rely largely or even solely on metadata, which might well contain no information about the ethics process. In order to find statements about ethical aspects of a publication, it is hence necessary to analyze its full text.
In subsequent rounds of mining, information from Wikidata can be used to finetune the entity recognition, e.g. by providing terms to be included in the dictionaries used for mining, or by providing context for entity disambiguation. For instance, geoinformation can be used to distinguish between Calvin University in South Korea and Calvin University in the United States. Further synonyms can frequently be resolved in a straightforward fashion: “X University” often maps to “University of X”, though for a small group of X (Wikidata knows 7 examples), both might exist as separate entities, either in close proximity (as is the case for Hyogo or Shizuoka), at different places within the same country (e.g. Rochester, Jinan, Miami), in neighbouring countries (Ottawa) or continents apart (York).
For common words, we may need stemming (“approved” => “approv~”) or more generally lexemes (“X is grateful” or “we are grateful”) => “X <be> grateful”. Modern NLP tools can now identify such phrases from their context with high confidence. Wikimedia has an active lexeme project which can resolve lexical forms and map them to concepts, e.g. the English terms “ethics committee” and “informed consent form” are represented by the Wikidata lexemes L497553 and L497589, respectively. These lexeme entries in turn link information about these English nouns, their grammar and meaning to information about the underlying concepts (e.g. Q59057226 for “ethics committee” as a subclass of committee) as well as equivalent terms in other languages, which can also occasionally be found in ethics statements.
For instance, Fig.
Complementing these mono- and bilingual examples, Fig.
Taking such cross-linguistic information into account can thus facilitate entity recognition in ethics statements even in English texts and help expand the methodology to mining articles in other languages as well, e.g. to identify or distill boilerplate phrases in a given language or cultural differences across languages in terms of how ethics-related information is handled. For any language with information about such boilerplate phrases, a score could be computed that could represent the similarity between boilerplate text and phrasing from a given article. Such scores could be used, for instance, to guide community curation efforts - high similarity to known boilerplate means high potential for automation and less need for human oversight, while low similarity indicates a need for community review.
The extraction of ethics statements is a special case of a more general requirement. Many such statements are formulaic, either because the discipline itself or the publication process requires it. Typically, these articles have paragraphs where the sentences are discrete and not part of a larger narrative flow. A simple test for this is whether the sentences
Looking beyond ethics statements, we have explored the range of syntactically similar sentences – frequently including boilerplate, named entities and perhaps identifiers like ethical approval numbers – and created a non-exhaustive list of manuscript components where they can frequently be found:
acknowledgements and thanks;
methods sections;
availability and location of data and software;
roles of authors and their contributions;
conflict of interest statements;
copyright statements.
The pipeline and the tools we are developing can extract semantic information from all such syntactically constrained sections of the scientific literature – not just ethics statements.
Irrespective of the textual representation and of JATS-style document markup, we posit that the factual elements of all ethics statements can be arranged to fit a grammar that relates the entities and is decomposable to a set of semantic triples. If true, this means that ethics statements can be formally encoded by authors as a graph and captured in a graph knowledge base. This graph would then be queryable by standard tools such as SPARQL. Typical examples might be:
The entities and the predicates linking them would be mapped to standard identifier systems, including Wikidata, which is integrated with many of the key resources in this space. For instance, ethics-related terms that have a MeSH Descriptor - e.g. ethical review, ethics committee, animal care committee, informed consent and consent form, or the Declaration of Helsinki - all have a Wikidata entry, as do related terms that do not have a MeSH Descriptor, e.g. ethical approval, ethical oversight, or the Nagoya Protocol. Good coverage of ethics-related terms can also be found in the Informed Consent Ontology.
In the future, an increased level of curation of such information could be used to enhance ethics mining efforts. Ideally, authors could, with help from an authoring tool, submit their ethics statement as a formal graph representation. One approach would be a public site which parses manuscript snippets and assists its users in mapping them to triple-based standardized statements about ethical aspects of one or more manuscripts. Assuming a user-friendly implementation, we hypothesize that authors would be prepared to accept a standard form of language that could also be machine-parsed.
The information curated this way could also be used to search more systematically for the context in which ethics-related information occurs (cf. Information Retrieval section), i.e. the more standardized language could be used as a lexical hook to fish for similar snippets elsewhere, then regularize them and ultimately collate and analyze the bulk information.
Mapping the relevant terms creates a valuable positive feedback process between miners, corpora and open resources like the Wikimedia platforms. In some cases, Wikidata is well equipped with synonyms but at present, the entries are often stubs with very little information. The snowballing process will generate possible synonyms which can be collected together and offered in tools like Mix’n’Match for human editors to submit to Wikidata, or in tools like Drnote (
In this work, we outlined a set of core ideas for mining the literature, extracting ethics-related entities and relationships, reconciling them with a controlled vocabulary, making the information queryable and creating a positive feedback loop between the structured information and the mining workflows by iteratively using one to improve the other.
Much like in other areas of data mining, initial challenges for the mining of ethics statements include handling inconsistent approaches to the naming of relevant entities (e.g. institutions, ethics committees, laws and other relevant policy frameworks). This is compounded by inconsistency as to where in a document the ethics statements are located (e.g. in a dedicated section, or as part of the Methods or in an Annex).
If these challenges can be addressed, the mining of ethics statements can provide significant value in terms of elucidating the research ethics landscape (highlighting relevant organizations, along with policies, guidelines and other standardization efforts) as well as documenting, improving, teaching and standardizing current practices in research ethics. A systematic analysis of the ethics statements will also highlight institutional, disciplinary and other contexts in which such statements are common or well-developed, uncommon or underdeveloped, or anywhere in between.
This can form the basis for studying ethical aspects of the research process - as well as ethics review - under specific conditions and for addressing ethical aspects of research both in practice as well as in teaching. For instance, key elements of contexts in which well-developed ethics statements are common - such as a clear policy, readily actionable community guidelines or scalable workflows - could serve as a starting point for exploring best practices or synthesizing recommendations, while other contexts could be explored in terms of their potential for improvements.
Another point to consider is that access to the ethics-related information contained in a publication currently requires access to the full text. However, the basic ethics data - such as whether the research reported in the publication received ethical approval, what the approving bodies were and what the relevant approval numbers are - should be considered metadata and in the public domain. Ideally, they would be incorporated into the filtering mechanisms provided by individual databases or scholarly search engines and visualization tools more generally. Some databases like stem cell registries (cf.
We plan to work towards implementing the core ideas presented here, and we very much welcome collaborations in this regard.
In particular, we plan to extract information and phrasing pertaining to ethics committees and other entities commonly found in ethics statements (e.g. policies and guidelines) and to make this information available via suitably annotated Wikidata items and lexemes that can in turn be used by mining pipelines. Once the data models in this area have stabilized, it would be possible to scale up these workflows by increasing their automation and expanding the mining to auxiliary materials like approval letters, which are currently shared only very rarely , or to annotating ethical aspects of things other than formal publications, e.g. clinical trials or their consent forms that are now increasingly being made public too.
Further, we plan to work on visualizations that present this structured information and that can be incorporated into suitable parts of the open knowledge ecosystem, particularly through Wikimedia platforms and associated visualization services like Scholia (
Beyond ethics statements, we plan to apply the ideas outlined here also to other non-traditional parts of research manuscripts, e.g. data availability or conflicts of interest. We also aim to explore how these approaches can assist with the enrichment of mining efforts targeted at less-mined aspects of manuscripts, e.g. the citation of data, software and material resources. In doing so, we will focus on resources that are openly available.
Neither the work proposed nor the work presented here has so far received funding.
The authors declare that they have no conflicts of interest pertaining to the research described here.