ORKG: Facilitating the Transfer of Research Results with the Open Research Knowledge Graph

This document is an edited version of the original funding proposal entitled 'ORKG: Facilitating the Transfer of Research Results with the Open Research Knowledge Graph' that was submitted to the European Research Council (ERC) Proof of Concept (PoC) Grant in September 2020 (https://erc.europa.eu/funding/proof-concept). The proposal was evaluated by five reviewers and has been placed after the evaluations on the reserve list. The main document of the original proposal did not contain an abstract.

Statistics 2020). Currently, however, this is not a good investment and, each year, an everincreasing share of this investment is wasted. The reason for this is that, for representing and sharing research findings, we use antique methods, which were developed many centuries ago. Since the beginning of modern science -with the publishing of the first scientific journals -the Journal des Sçavans and the Transactions of the Royal Philosophical Society in 1665 (Mack 2015, Spinak and Packer 2015) -we use the same methods for representing and sharing scholarly knowledge: scientific articles. At the time of the polymath Gottfried Wilhelm Leibniz in the 17th and 18th centuries, a single researcher could still read the entire published scientific literature.
Today, each year, 2.5 million new research articles are produced. Even in a relatively narrow scientific field, it is impossible to read, comprehend and make sense of all scientific articles. For example, publications from 1980 to 2012 show an exponential growth rate of 3% annually (Bornmann and Mutz 2015).
For the genome editing method CRISPR/Cas9, for example, the research search engine Google Scholar lists a quarter-million publications available as PDF articles. If a researcher is interested in how good the method is compared to other genome editing methods, what specifics it has when applied to insects and who has applied it to butterflies, a researcher needs either years of experience or is very likely not to find what he or she is looking for. Imagine that, to order a new iPhone, you had to compare prices by checking dozens of mail order catalogues published as PDF or, to navigate to a hotel, you would need to look at a PDF scan of a street map. This is exactly how the exchange of research findings works today -the previously analogue articles from scientific journals are now made available and distributed as PDF documents.
The new methods of the digital world, such as filtering large amounts of data and information, integrating information from different sources or involving users via crowdsourcing to review and help to organise the information, are non-existent in scholarly communication. Researchers are drowning in a flood of millions of pseudo-digitalised PDF publications. As a result, some research is seriously flawed: many research results cannot be reproduced by other researchers, peer-review struggles to cope with volume, speed and quality and we have more and more redundancy. Major social challenges, such as handling the COVID-19 pandemic and infodemic (WHO 2020) or implementing climate neutrality, require interdisciplinarity and putting bits and pieces from different disciplines together, which is currently extremely cumbersome and resource-intensive.

1a.2 The solution
In the ERC ScienceGRAPH project, we are researching and devising foundational concepts for organising scholarly communication in a knowledge-based way, leveraging a new formal model -cognitive knowledge graphs. According to this model, research contributions are represented in a human and machine-readable manner -the knowledge graph. As a result, completely new ways of machine assistance, such as semi-automatic generation of state-of-the-art overviews, visualisations or even question-answering applications become possible. To prepare the demonstration of the ScienceGRAPH results, the ERC project partner TIB Leibniz Information Center for Science and Technology (also directed by the ERC grant holder Sören Auer) started to develop the Open Research Knowledge Graph service, available at https://orkg.org. As an example, Fig. 1 shows a state-of-the-art comparison of different studies targeting the research question about the R base infection rate of COVID-19.
Based on such a structured semantic and machine-readable representation, various other exploration and assistance tools are also possible, for example, a chart visualisation, aggregating the results from the various studies. This example illustrates the solution to problems for various stakeholders: • Researchers in the field (here epidemiologists and virologists) can get a quick overview of the state of the scholarly discourse related to a particular research question and determine gaps or how they can devise their approach to make their contributions stronger.
• Peer-reviewers can quickly assess the merits of a particular approach and view it in comparison to the current state-of-the-art.
• Publishers have a tool for assisting their editors, editorial managers, reviewers and authors to make contributions stronger and better positioned in the scientific discourse. In addition, publishers using such semantic descriptions and comparisons will dramatically increase the attraction of their journals.
• Equipment and instrumentation manufacturers can ensure that important configurations of materials used in research are documented and the use of their devices is properly acknowledged and visible.
• Industrial and societal stakeholders get faster and better access to the state-ofthe-art and can, thus, more efficiently and effectively realise research-based products and services.
While some user groups will not pay directly for this solution (e.g. researchers and peerreviewers) and the ORKG will be an open infrastructure in general, we see the potential for To realise the potential, with this ERC PoC project, we aim to demonstrate some key results attained in the first two years within the ORKG.org proof-of-concept: • Integrate the crowd-and expert-sourcing authoring and curation model for cognitive knowledge graphs, based on the knowledge graph cells concept (Vogt et al. 2020).
• Integrate persistent identifiers for scientific sensors and instruments to support the provenance and reproducibility of research results from experiment to publication.
• Develop approaches for generating comprehensive state-of-the-art overviews for a specific research question from the semantic knowledge graph representations of corresponding contributions.

1b. Demonstration of Innovation Potential
The ORKG is completely unique in its idea to describe scientific contributions in a knowledge graph. There are several other knowledge graph projects for scholarly communication also from commercial players, such as SciGraph from Springer Nature or the Microsoft Academic Graph. However, these initiatives solely focus on bibliographic information and do not comprise a rich structured representation of the actual content of the publications. Other related initiatives are text-mining projects, such as SemanticScholar, which generate some relatively shallow semantic descriptions automatically. However, due to the low precision and recall of text mining methods (in particular for relation extraction), this does not go beyond relatively simple classifications, annotations and summarisation of the content and, thus, does not suffice for creating a comprehensive knowledge graph representation and exploration services, such as comparisons, visualisations, question answering, etc.
Section 2: The Expected Impact 2a. Identification and description of any effect or benefit to the economy, society, culture, public policy/services.
The results of this ERC PoC project can have a dramatic impact on the effectiveness and efficiency of research and how research results are transferred into applications. We expect that research will be at least 10-15% more efficient with corresponding positive effects on the effectiveness of the annual research spending of almost US$1.7 trillion worldwide. Especially the scholarly publishing industry, with an annual US$10 billion market (Research and Markets 2020), would significantly benefit from the results of this project. In the following, we describe the impact on the research instrumentation industry in more detail.
Sensors and scientific instruments are important in the research cycle for several academic disciplines. Sensors, for example, are used for permanent measurements in agriculture and scientific instruments are used in laboratories to carry out scientific measuring. There is a need to develop persistent identifiers (PIDs) for sensors and scientific instruments and several initiatives are working towards that goal. The Digital Object Identifier (DOI) is a common example of a PID widely used for publications and research datasets and further identifiers are, for example, handles. Sensor platforms in agriculture have assigned PIDs and there is widespread use in the scientific community, but scientific instruments are usually not citable in publications. The proposed ORKG PoC will generate several benefits for the economy. There is a need to introduce the project outcomes of the ERC-funded ScienceGRAPH in the market of sensors and scientific instruments. . The citation of instruments in publications that were used to carry out the research (e.g. measuring) would contribute to more transparent communication of research results. Some exceptions are already mentioned in publications, such as electron microscopes or particle accelerators. Instrument citation could be achieved by extending the DataCite schema that is currently being used for research data, amongst others. This extension could include, for example, the model number of instruments, date of purchase, use in a research project, maintenance of instruments and the calibration of instruments. The business office of DataCite is located at TIB and the R&D team has already held discussions on this topic. If the DOI suffix of a publication is extended by mentioning the related scientific instrument, this would provide several advantages. Scientific instruments could be initially registered by the manufacturer, which would require a new membership to register DOIs via DataCite.
The PoC would build on the basic research that is being carried out as part of the ERCfunded ScienceGRAPH project, but would provide an automatic connection to the Open Research Knowledge Graph (ORKG) that is also operated at TIB and focuses on applied R&D. Furthermore, we will prepare a use case in the Integrated Carbon Observation System (ICOS) research infrastructure in collaboration with LI-COR Biosciences. Further use cases would include more academic disciplines, such as engineering at Leibniz University Hanover (LUH) and life sciences at Hanover Medical School (MHH). There is already a well-established collaboration with Collaborative Research Centres (SFBs funded by the German Research Foundation or DFG), such as the SFB "Tailored Forming" at the Hanover Centre for Production Technology as part of LUH.
Structured machine-readable data will provide a competitive advantage for our industry partners since instruments, registered with a PID, will have an advantage over those from other companies. Apart from manufacturers of sensors and scientific instruments, the PoC will generate benefits for academic publishers, researchers and research infrastructures. Academic knowledge is generated at different points in time and not only while publications are being written by researchers. As such, saved metadata from instruments would make these efforts visible. Potential reuse in further follow-up projects might be applied in laboratory information management systems (LIMS). This could be done, for example, in collaboration with the Julius Kühn Institute, a federal research centre for cultivated plants in Germany, which already collaborates closely with the R&D team at TIB through other projects. Furthermore, TIB established contacts with the software engineering company Limsophy LIMS. The outcome of the PoC will be a prototype with TRL 7 that can be further developed by the industry partner in collaboration with researchers.
Apart from economic benefits, the ORKG PoC will also generate benefits for society. The coronavirus pandemic demonstrated once again that there is a need for transparent measurements of scientific results. The proposed project will enable FAIR (findable, accessible, interoperable and reusable) research information and research data for several stakeholders. The reproducibility crisis fuels an ongoing debate in research and research policy (Fanelli 2018). Furthermore, this also relates to issues with replicability and several projects try to tackle this challenge (Whole Tale 2020). The project outcomes will reduce challenges of reproducibility and replicability in certain academic disciplines. What is more, sensors are strongly promoted in public policy and services, for example, with regard to digitising European industry and advancing the Internet of Things (IoT). As such, they also contribute to building a Digital Single Market, one of the key priorities of the European Commission (European Commission 2018).

2b. Outline of the value creation process
To maximise the societal benefit from the results of the ERC and this PoC project, the core ORKG service will be an open infrastructure following the Open Science, Open Access and Open Source principles. This also enables rigorous and large-scale testing and evaluation of the outcomes of the project with real user communities. TIB is prepared to sponsor and further develop, maintain and operate the ORKG service in the long term. In addition to the open strategy, we envision various commercialisation opportunities including: • Providing value-added services tailored for commercial scientific publishers, such as Springer Nature, Wiley and IEEE Publishing.
• Providing commercial data, analytics and question answering services for speeding up the spread and transfer of research results in industrial applications.
• Partnering with industrial stakeholders, in particular scientific instrument manufacturers, regarding sponsoring of the ORKG and integration of their instrument descriptions.
TIB has long-term established R&D collaborations and customer relationships with small and large industrial stakeholders. TIB already provides commercial literature access services to > 100 customers and aims to expand this to the research analytics services offered on the ORKG service infrastructure.
Section 3: The proof of concept plan 3a. Project-management plan including risk and contingency measures

3a.1 Organisational structure and decision-making process
Since the ORKG project is relatively focused, we envision a lean organisational structure depicted in Fig. 2.
In addition to the PI and the ORKG development lead, the organisational structure will involve leads of the three ORKG work packages, an industrial advisory board as well as an ORKG community board.
The industrial advisory board will advise the project team in matters related to the commercialisation of the results, such as product features, product and service offerings, IPR, pricing, as well as legal matters. We will organise quarterly meetings of the board. We have been in touch with several industry representatives about joining the board. Organisational structure and decision-making process.
The ORKG Community board will advise the development team with regard to community requirements and comprise experts from various research fields, research data infrastructures and open-access publishers. We plan to organise quarterly webinars or workshops with the community advisory board (possibly in conjunction with larger scholarly communication events).
Decision-making and development methodology. The size of the project allows it to follow a lean focused decision-making process, where most of the decisions are made in the regular weekly ORKG project meeting by involving the whole team. For all developments, we follow the agile KANBAN-inspired development methodology aiming at establishing a constant active development process by optimising the issue burn rate and establishing a proactive communication culture.

3a.2 Plan for the identification and acceptance or off-setting of possible risks
We aim at identifying, evaluating and eliminating or minimising potential risks that may jeopardise the success of the project. While some relevant project risks and how to address them are already identified, risk management will be conducted throughout the project. It is a continuous process in which known risks will be regularly reviewed and new risks will need to be recognised to handle and control them adequately. Their assessment will lead to the formulation of appropriate mitigation measures that should help to prevent and overcome a risk or reduce its effects to an acceptable level. The process behind risk management can be broken down as follows: 1.
Risk identification (i.e. recognise and describe risks).

2.
Risk analysis (i.e. analyse likelihood and consequences of risks).

3.
Risk assessment (i.e. determine magnitude/acceptability of risks for the project).

4.
Risk response planning (i.e. create and execute an action plan to prevent or minimise risks).

5.
Risk control (i.e. monitor, track and review risks and mitigation actions). Table 1 contains some examples of risks and corresponding mitigation strategies that we already identified.

3b.1 Team, achievements and experience
The team is led by ScienceGRAPH PI Prof. Dr. Sören Auer. He is supported by the ORKG project head Dr. Markus Stocker, who has been leading related research and development activities for almost two years. In addition, a seasoned team is already established, including experienced PostDoc researchers (e.g. Dr. Jennifer D'Souza and Dr. Lars Vogt), more than five PhD students, software developers (Manuel Prinz and Kheir Eddine Farfar) and business and technology transfer experts (especially in the TIB departments), which can be dynamically involved in the project as required.

Entrance of new competitors
We aim to gain as much competitive advantage as possible and to increase user/customer fidelity by open science infrastructure. In addition, we aim to build an open interoperable ORKG service ecosystem.

Lack of qualified personnel
As a research institute with a close connection to a university department, we have direct access to skilled master graduates. In addition, we have built an international reputation making us an attractive target for qualified international candidates.

Lack of user and customer adoption
We align the development process as closely as possible with user/customer requirements and, thus, aim to maximise adoption success. In addition, we follow an iterative development process with regular intermediate evaluations and community building.

Leaving of a key person
Already now, the ScienceGRAPH/ORKG team divides the work on several individuals, thus reducing the dependency on a single person. In addition, the skills to perform key activities are aimed to be made available by at least two people.

Lack of funding and investors
The ORKG Service is of strategic interest to TIB and even in the absence of further external funding, TIB is committed to sponsoring ORKG. In addition, we will actively work on attracting further sponsors, create awareness in politics for the open infrastructure and build a sustainable business model on top of the ORKG, based on value-added services.

3b.2 Roles of the team and main strengths and weaknesses
The role of the PI Prof. Dr. Sören Auer is to develop and communicate the strategic vision of the project and to devise the key development milestones and priorities. He will advise and mentor the PhD students and PostDocs on the project and work closely with the ORKG development lead Dr. Markus Stocker. A further focus of the PI is to build strategic partnerships, attract further funding, sponsoring or investments. The ORKG project development head, Dr. Markus Stocker, will lead the day-to-day operations and developments of the project. He will lead the regular KANBAN sessions together with the development deputy Manuel Prinz and guide the research and development along with the community and advisory board defined requirements and strategic priorities. Alexandra Garatzogianni will lead the business development strategy and contribute to building and maintaining sustainable sponsor, partner and customer relationships for the ORKG service ecosystem throughout and beyond the project's duration. The main strengths and weaknesses of the team include the following:

Key strengths
• Successful track record of translating research excellence into large scale applications including successful commercialisation in a spin-off. • A long history of industrial collaborations.
• ORKG innovation concept with an enormous value potential.
• A seasoned team including a variety of backgrounds and skills: experienced PostDoc researchers, PhD students, software developers and business experts, who can be dynamically involved in the project.

Weaknesses
• Limited resources compared to commercial entities (e.g. commercial publishers).
• Community and industrial buy-in just starting to develop.
• More advocacy, policy backing for the transition/digitisation in scholarly communication required.
• Initially limited possibilities for automation using AI and machine learning due to the lack of training data.

3c. Plan of the Proof of Concept -Action description
Objectives: The overall objectives of the ORKG project are: • Mature the existing ORKG service prototype, establish interoperability with publishing platforms, prototype services for research result exploitation and devise possible business models.
• Integrate support for persistent identifiers and semantic descriptions for scientific sensors and instruments and evaluate the integration with concrete research infrastructures and vendors.
• Enable FAIR semantic descriptions and the generation of SOTA Surveys for automatically generating survey and review publications from the ORKG infrastructure.

Description of work:
Table 2 summarises the tasks and corresponding resources planned in the three work packages.

Allocation of resources:
The lump sum will be primarily used to fund the personal resources of the team. There are some further minor cost items, such as travel or minor equipment expenses, which will also be financed from TIB directly.

WP1 ORKG Service Maturation and Business Model Development
The goal of this work package is to mature the ORKG service by integrating two functions particularly important for the exploitation of the results. This includes: 1) the establishment of interoperability interfaces with traditional journal and proceedings publishing platforms of commercial publishers and 2) the prototyping of services for research exploitation and transfer analytics, based on the current ORKG knowledge graph infrastructure. Finally, we will work on the business development by outlining commercial offering options with the corresponding market and pricing analysis.

T1.1 Interoperability with traditional scholarly publishing platforms
Traditional commercial scientific publishing platforms organise the submission, peer-review and publication process of scientific articles (e.g. in platforms, such as Clarivate's ScholarOne Manuscripts). Each of these three steps is highly relevant regarding integration with the ORKG: 1.
In the submission process, authors can be encouraged to create an ORKG representation of their key contributions, thus facilitating the comparability of the state-of-the-art.

2.
Peer-reviewers can subsequently use such comparisons, visualisations and further aggregated views to assess the merits of the scientific contribution.

3.
After publishing an article, the semantic representation in the ORKG along with additional comparisons, explorations and visualisations will provide further context and insights to the readers of the published article. We will provide a REST API integration interface, where small user interface widgets can be directly integrated with minimal efforts into the respective publishing management systems.
Result: Integration interface for embedding UI widgets directly into publishing management systems.

T1.2 Services for research exploitation and transfer analytics
Based on the structured semantic representations in the ORKG, completely new analytical services for the exploitation of research results become possible. In this task, we will prototype such services, which can be a key pillar for commercial exploitation via an attractive service for the research, innovation and product development departments in enterprises. For example, for a particular research problem, the most promising approaches addressing this problem with regard to certain framework conditions can be identified. In addition, the impact and consequences of following particular approaches can be compared and analysed.
Result: Prototypical research exploitation and transfer analytics services.

T1.3 Business Model Development
In this task, we will develop a portfolio of possible business models, based on the ORKG services developed in this PoC project. For each of the possible service offerings, we will analyse the competition, market, competitive advantage, customer profiles, pricing options along the business model canvas paradigm. We will also compile a list of possible options for further funding and investment to advance the ORKG service to the next commercialisation and exploitation level. Aspects, such as impact assessment, exploitation, sustainability roadmap and implementation, will be appropriately researched and implemented, ensuring thus the successful and sustainable uptake of the project's output.

Result:
Prioritised list of business model options organised along the business model canvas paradigm.

WP 2 Persistent Identifiers for Scientific Sensors and Instruments
Instruments play an essential role in creating research data. Given the importance of instruments and associated metadata for the assessment of data quality and data reuse, globally unique, persistent and resolvable identification of instruments is crucial. The Research Data Alliance Working Group Persistent Identification of Instruments (PIDINST), chaired by Dr. Markus Stocker, developed a community-driven solution for persistent identification of instruments . Based on an analysis of 10 use cases, PIDINST developed a metadata schema and prototyped schema implementation with DataCite and ePIC as representative persistent identifier infrastructures and with Helmholtz-Zentrum Berlin für Materialien und Energie (HZB) and British Oceanographic Data Centre (BODC) as representative institutional instrument providers.
In this work package, we plan to implement and integrate the concept for persistent identification and semantic description of sensors and instruments into the ORKG service infrastructure project, thus greatly facilitating reproducibility and reusability of research results.

T2.1 Integration of persistent identification and description of sensors and instruments into the ORKG
In this task, we will integrate key functionality for the persistent identification and semantic description of scientific instruments into the ORKG infrastructure. This will involve the integration of the PIDINST metadata schema, the creation and alignment of identifiers, the management of revisions, provenance tracking and the integration of interfaces for automatic import and alignment with vendor-supplied instrument and equipment descriptions. For the latter, we envision a JSON-LD REST interface, which will enable vendors to directly represent and upload their descriptions according to the PIDINST schema.
Result: Comprehensive representation and integration of scientific instrumentation in the ORKG.

T2.2 Evaluation with concrete research infrastructure providers and equipment vendors
In this task, we will work with concrete research infrastructure providers and equipment vendors on testing and evaluating the integration developed in T2.1 and creating demonstrations and showcases for attracting further research infrastructure providers and scientific instrumentation equipment vendors. We already identified a shortlist of infrastructures, such as ICOS, Leibniz DSMZ or the virology labs at TWINCORE and Hanover Medical School (MHH). Concerning instrument vendors, we have close ties to important players in the market, such as LI-COR Biosciences, Zeiss and Leica. In addition, we plan to outreach to Thermo Fisher Scientific, Shimadzu, Roche Diagnostics, Agilent Technologies and Danaher to scale the number of showcases and integrations.
Result: Comprehensive portfolio of research infrastructure and scientific instrument showcase integrations.

WP 3 FAIR Semantic Descriptions of Research Quests, Contributions and SOTA Surveys
The goal of this WP is to organise scholarly communication in a structured knowledge graph-based manner. We will, thus, go beyond static PDF publications and make research problems, approaches, algorithms, implementations and evaluations FAIR and first-class citizens of the scholarly discourse.
Science typically involves the definition of research problems or questions and corresponding research approaches contributing to solving these problems or questions. Examples of research problems or questions are Named Entity Recognition, Question Answering, Machine Translation, Image Recognition or Data Clustering. Contributions addressing these problems are typically following a particular approach and are evaluated using some benchmark dataset. Currently, all this information is deeply hidden in unstructured articles, often published as PDFs. In this measure, we will make research problems, questions, contributions and their description first-class citizens of the scholarly Data Science communication. We will build on the already established Open Research Knowledge Graph (ORKG) platform (https://www.orkg.org) and expand it in three yearly iterations with crucial functionality for data science and AI research. Subsequently, we will further evaluate, broaden the applications and scale the use of the platform.

Task 3.1 Development of templates for semantic descriptions of science contributions
In this task, we will develop a comprehensive library of semantic templates for research question and contribution descriptions. The templates will be represented in a formal way (e.g. according to the W3C SHACL standard) and, thus, facilitate interoperability between various services. In particular, we will demonstrate the applicability of the templates with the Open Research Knowledge Graph, which provides an environment for authoring, organising and curating semantic research question and contribution descriptions. We will also integrate techniques to automatically extract and represent information from articles according to the templates.
Result: Library of semantic templates for research question and contribution descriptions.

Task 3.2 SOTA Comparisons and Leaderboards
We will use the semantic descriptions of data science and AI approaches and publications to generate comparative overviews and leaderboards on the approaches addressing a particular research question or problem. The approach for generating such SOTA overviews will be highly automated, but enabled to be configured and fine-tuned by users. We will integrate functionality to publish (using DOIs), integrate and link such comparative overviews directly from traditional publications (e.g. via LaTeX/BibTeX or Word export). Leaderboards will give a comprehensive overview on the evolution of the SOTA over time with regard to concrete performance indicators (e.g. precision/recall) attained on community-defined benchmarks.
Result: Automatic comparison and leaderboard generation with a focus on the SOTA evolution.

Task 3.3 Authoring environment for cognitive knowledge-graph-based surveys and reviews
In this task, we will integrate the service elements and functionalities developed in other tasks of this measure into a comprehensive environment for creating structured SOTA survey articles for specific Data Science and AI research questions. The structured elements will comprise a motivation of the research problem, its definition, a classification taxonomy and qualitative (functional) and quantitative approach characterisations, as well as problem-specific visualisations and leaderboards. The survey article will be compiled automatically and directly from the structured semantic knowledge graph representations, but represented as a self-contained article publishable as a Web resource (or PDF). We will assign DOIs and enable the publication of these surveys in traditional publications outlets, such as journals and OA repositories.

Result:
Publishing environment for structured surveys and reviews with integration with traditional publishing outlets.