A virtual “Werkstatt” for digitization in the sciences

Data is central in almost all scientific disciplines nowadays. Furthermore, intelligent systems have developed rapidly in recent years, so that in many disciplines the expectation is emerging that with the help of intelligent systems, significant challenges can be overcome and science can be done in completely new ways. In order for this to succeed, however, first, fundamental research in computer science is still required, and, second, generic tools must be developed on which specialized solutions can be built. In this paper, we introduce a recently started collaborative project funded by the Carl Zeiss Foundation, a virtual manufactory for digitization in the sciences, the “Werkstatt”, which is being established at the Michael Stifel Center Jena (MSCJ) for data-driven and simulation science to address fundamental questions in computer science and applications. The Werkstatt focuses on three key areas, which include generic tools for machine learning, knowledge generation using machine learning processes, and semantic methods for the data life cycle, as well as the application of these topics in different disciplines. Core and pilot projects address the key aspects of the topics and form the basis for sustainable work in the Werkstatt.

organizations, and backgrounds come together.Thus, an ideal framework has been created, within which all steps from the promotion of young scientists to the joint processing of challenging, fundamental research projects will take place.Besides, the "Werkstatt" will also be involved in other activities for the promotion of young talents and networking.To implement this strategy, among other things, annual retreats, joint teaching, and learning activities are planned.

Objectives
The "Werkstatt" addresses the following three main research topics in computer science (T1-T3) as well as the application fields (T4):

T1. Generic tools for machine learning
An increasingly important topic within machine learning is its automation (AutoML).First steps are the preparation of the data, e.g., by transformation into suitable data formats, the treatment of missing values, or the correction of erroneous data entries.Moreover, most algorithms are highly parameterized and require a good choice of parameter values, e.g., the architecture of a deep neural network.The algorithm must be executed and its results analyzed, with the question of choosing appropriate metrics and measures of success.The analysis may require additional data to be collected later.These diverse, manual optimization steps usually require methodological expertise.Therefore, one goal is to adopt novel methods of AutoML design by using a systematic search.Another challenge for the successful use of machine learning is when the question to be answered by the data is not fixed a priori.One way to deal with this is to learn a flexible, generic model that a human analyst can examine with a suitable visual interface.For this purpose, due to their generality, multivariate probabilistic models are very suitable.A third important area for the development of generic tools for machine learning is the provision of automatic differentiation methods.Frequently, derivatives of an objective function to be optimized are required during learning.Automatic differentiation enables the automated generation of computer programs that calculate first or higher order derivatives of such target functions.However, typical target functions in machine learning have special structures that are not used in standard automatic differentiation methods.A goal that should, therefore, be pursued in this area is, in general, techniques of automatically distinguishing specific structures from the machine to adapt to learning.Such targeted program transformation could result in programs that require significantly less space or whose execution time is significantly shorter than general automatic differentiation techniques.

T2. Knowledge generation using machine learning processes
Machine learning methods, in particular supervised learning methods, have been able to learn input/output relationships from data for years and thus solve problems such as classification or regression tasks with high accuracy.In the field of image and scene analysis, the achieved accuracy reaches or surpasses the performance of humans (Taigman et al. 2014).
With all the advances in generating the correct input/output relationship, challenging issues still exist in data-driven solutions.Scientists who want to generate knowledge from data, would like to make a model refinement via data-model integration and if necessary, switch between hypothesis-and data-driven procedures.Despite the tremendous power of deep neural networks, the following questions are still open and need extensive research efforts: a) Can the machine learning process be understood and reconstructed?b) Is it possible to detect cause-effect relationships by data-driven methods and thus to identify causal relationships for example in multivariate time series?c) How to integrate knowledge about, e.g., physical laws, with a data-driven approach to improve the approximation accuracy of the data estimated model and at the same time reduce the effort required for training (data collection, data annotation, the number of learning steps, etc.)? d) Can a large amount of annotated training data, which are still required for many approaches, be reduced in order to open up new fields of applications, in which such data collection is currently too expensive or completely impossible?In this Werkstatt, we aim to add new contributions to the line of tackling these issues.The main focus will be on the development of unsupervised learning methods as well as data-driven solutions for causal inference in dynamical systems.

T3. Semantic methods for the data life cycle
The added value of digitizing the research processes arises not only when the individual steps are improved by intelligent systems, but also when integrated support exists throughout the entire scientific (or data) life cycle.Semantic methods offer promising approaches here.An essential component of digitization of research processes is a precise and machine-understandable description of the experimental data, which helps in the retrieval and re-use of data, the reproducibility of the results, and more generally, scientific progress.However, scientists are reluctant to provide the descriptions because this requires high effort and time.One goal of this topic is to investigate how experiment data can be semi-automatically annotated with semantic metadata with the support of artificial intelligence.Another aspect is the role of data curation and data quality control in this environment.The quality of the research results depends heavily on the quality of the data.However, in the face of a rapidly growing amount of data, the major dilemma is that a high curation level for all the data is not realistic and not cost-effective.In this Werkstatt, we aim to develop novel methods for determining the data value and integrate the methods for the control of curation processes into systems.The main focus will be on the construction and maintenance of knowledge graphs.

T4. Applications
The methods developed in T1-T3 are intended to be used, evaluated and optimized directly in fields of application and enable discipline-specific breakthroughs.The development of fundamentals in machine learning and data management in the sciences can only be successful in close cooperation between computer science and the respective disciplines.

Impact
Digitization and data-driven sciences have very high priority in the strategic planning of the FSU Jena.Research on intelligent systems is of importance for all three research foci, called "profile lines'" of the FSU "Light, Life, Liberty".As early as 2015, the MSCJ was established as a cross-sectional structure.The MSCJ consists of scientists from seven faculties and three non-university research institutes in Jena (Max Planck Institute for Biogeochemistry, Max Planck Institute for Human History, DLR Institute for Data Science).This project aims to contribute to the strengthening of this structure through the development of the Werkstatt, which will facilitate cooperation both between different areas within computer science and specialist sciences, across organizations and across all career stages.
The structural goal of the project is the networking and expansion of the relevant expertise, the establishment of know-how on intelligent systems in as many specialist disciplines as possible, and the establishment of stable cooperations, thereby opening up the possibility of initiating further third-party funded projects.

Implementation
The Werkstatt comprises three core projects and ten pilot projects based on the objective areas.It also comprises activities for international networking and local communications as well as promotion of young talents.In each of the three subject areas, a five years core project is being conducted, each with one postdoctoral researcher.The three core projects shall ensure methodological continuity and also support the pilot projects with expertise and tools.The pilot projects run for three years; each is equipped with one PhD student.There will be a second, somewhat smaller set of pilot projects starting in year three of the Werkstatt.Table 1 provides an overview of the core projects (C1-C3) and the current pilot projects (P1-P10) and their coverage of the four themes.The core projects are: Core Project C1: Automatic machine learning

C1: Automatic machine learning
In order to use today's powerful machine learning systems even without detailed machine learning knowledge, e.g., the meaning and effects of the involved hyperparameters, tools for automating the entire process of data analysis and machine learning are necessary.At the very least, one would like to assist a human analyst with the help of machine tools to process the individual steps.The aim of this core project is the further development of Table 1.
Project overview and coverage of the four main themes: T1.Generic tools for machine learning, T2.Knowledge generation using machine learning processes, T3.Semantic methods for the data life cycle, and T4.Applications (black/empty square indicates primary/secondary contribution).
methods from the area of AutoML.AutoML is looking for automatic or semi-automatic methods for (1.) data preparation and transformation, (2.) selection of suitable analysis algorithms, (3.) execution of the algorithm, (4.) determination of good hyperparameters of the machine learning algorithms, (5.) selection of measures of success, (6.) data collection of missing data, and (7.) report generation.

C2: Generative models
Deep learning methods achieve their remarkable performance from a very large amount of annotated training data.However, in many applications in the sciences, e.g. in medicine, the annotation by the user/expert is very expensive or completely impossible.This core project therefore deals with modern, unsupervised learning methods and their application in the sciences.Specifically, methods for learning generative models, such as e.g.Generative Adversarial Networks (Goodfellow et al. 2014), are examined and expanded so that, for example, unsupervised methods for anomaly detection can be developed.In addition, such models might also be used for data augmentation, i. e. to generate additional training data, or be the basis for estimating and subsequently analysing the underlying data distribution.This core project supports two pilot projects, P2: Detection of causal relationships through deep learning, and P3: Data-driven virus diagnostics at multiple levels I (Methods).

C3: Integrated provenance management
One of the essential requirements of scientific experiments is that they are reproducible.
For the reproducibility of results, it is important that the experiments are described in a structured way which contains information about the experimental setup and procedure.In recent years, a large number of domain ontologies have been created in numerous application disciplines, which formally model sections of the respective field (see, for example, the NCBI BioPortal with almost 700 ontologies).The expectation is that with the help of these ontologies, finding and linking data can be significantly improved (Walls et al. 2014, Klan et al. 2017).Analogous to the development of domain ontologies for the formal description of data, different approaches to the formal process description have emerged.This includes the provenance ontology PROV-O (Lebo et al. 2013).An extension, REPRODUCE-ME ontology (Samuel et al. 2018) is intended to describe end-to-end provenance of scientific experiments.Independent of these approaches for modeling scientific workflows, first suggestions were published on how such a description for experimental steps using machine learning could look like (Schelter et al. 2018).However, there is no integration of the descriptions of different types of experiment steps, as well as procedures for the further automation of provenance recording and for the use of these descriptions.This gap should be addressed here.The goal of this project is to develop an integrated semi-automatic approach for the management of provenance with human and machine-understandable descriptions which can be used for different purposes.
The tools developed in these three core projects will be applied in the following ten pilot projects:

P1: Probabilistic modeling
In this project, new generic algorithms for learning multivariate probabilistic models are to be developed.To this end, the convex programming approach of Chandrasekaran et al. (2012a) and Chandrasekaran et al. (2012b) is to be extended to mixed distributions.On the one hand, it should be examined whether the theoretical consistency analysis of the optimization approach also applies in this more general framework, on the other hand, this approach should be implemented.A scalable implementation is a non-trivial endeavor since the modeling will likely include group norms above the cone of positive semi-definite matrices, for which there is no standard software.In addition, for an effective implementation an appropriate search strategy based on the regularization parameters of the problem is required.For this purpose, methods from vector optimization are to be developed further.The possible applications of the learned, mixed distributions are diverse.They range from the direct modeling of data from mixed feature spaces to the support of an explorative, interactive analysis of multivariate data sets.

P2: Detection of causal relationships using deep learning
Understanding causal effects in dynamic systems plays an important role in numerous disciplines such as economy (Granger 1969), neuroscience (Seth et al. 2015), climate and ecosystems (Runge et al. 2019, Shadaydeh et al. 2019, Trifunov et al. 2019) among many others.The inference of unknown causal effect relationships between the variables of a complex dynamical system from large amounts of data allows for gaining further knowledge on the system's behaviour.Machine learning methods often attempt to model joint distributions of sets of variables, with the objective of estimating the likely values of certain variables given observations of others.However, modelling joint distributions is not sufficient to make inferences about the values of some variables when other variables are manipulated.The later problem requires an understanding of the cause-effect relationships between the variables (Peters et al. 2017, Pearl 2009).The aim of this project is to investigate to what extent causal relationships in multivariate nonlinear dynamical systems can be derived from learned, deep networks.This will also allow for better insights into the functioning of deep networks and integrates causal relationships in a constructive way.The developed methods will be used and implemented in important and generic tools of the Werkstatt.

P3: Data-driven virus diagnostics at multiple levels I (Methods)
This project aims to develop new data analysis and learning methods to improve the diagnosis of known and unknown viruses and their interaction with other pathogens.We start from three complex data streams: sequences (genome and transcriptome of virus and host), images (virus, through TERS technology newly developed in Jena), and molecular spectra (mass spectra of virus and host), which are always cheaper and available in large quantities.We will develop new methods of machine learning and stochastic inference for the metabolomics of viral infections (molecular level).The development is based on our methods (SIRIUS, CSI: FingerID) for gas chromatography and electron ionization.We will develop learning and classification methods that use knowledge of the biological system (virus/ host) in the form of dynamic reaction networks (network level).The learning should take place both by adapting kinetic parameters and by adapting (learning) the network structure through a new form of genetic programming (CMAES-GEGP).At the molecular level and network level, knowledge is generated in the form of learned models, which should then be used for diagnostics.

P4: Data value oriented curation processes
The creation and continuous update of large knowledge bases (mainly as knowledge graphs or factual databases) is a common scenario in various disciplines, like Digital Humanities (Haslhofer et al. 2018, Maicher et al. 2009), Citizen Science (Bonney et al. 2014), biodiversity research (Nadrowski et al. 2012) or even economics and managerial sciences (Dalle et al. 2017).In most cases, these knowledge graphs are a curated collection of observational data.Usually, the creation of the knowledge graphs is based on various heterogeneous data sources, integrated and augmented through automatic and semi-automatic approaches, usually combined with manual inputs, curations and amendments.One of the main challenges is the continuous update of the knowledge graphs, irrespective of whether one dozen facts are updated a day, or hundreds of thousands.Data quality aspects like incompleteness, inconsistency, and scalability of curation are central issues hereby (Lukyanenko et al. 2016).Existing rigorous data quality assurance approaches are limited in scenarios where contributions to the knowledge graph are more accidental, fragmented, and voluntary, as it can be expected in citizen science or other crowdsourcing scenarios.Within the Werkstatt, we will experiment with a new approach for data quality assurance: data-value oriented curation processes.Our approach assumes and accepts that the knowledge graph is always imperfect.But defines the requirements on the knowledge graph, to be fulfilled for realizing the functionality (hence: the value) based on this data.In a second step, we implement a system, which continuously observes the level of fulfillment of these data quality requirements (Prilop 2014).In a third step, these measurements on the fulfillment of the requirements automatically inform a workflow system, which guides and priorities the work of all authors and curators of the knowledge graph.By applying this approach, we expect with the given human resources the highest value for research or applications can be extracted from continuously updated knowledge graphs.

P5: Learning of data annotations
In many scientific disciplines, sharing data and reusing it in different contexts becomes more and more important.In order for such re-use to be truly successful, a good, ideally machine-readable description of the data is essential.However, today, creating such descriptions requires considerable manual effort.This project aims to utilize and extend machine learning techniques in creating the semantic annotations for existing datasets thus considerably reducing effort and enabling the reuse of untapped resources.Different data sources from the Biodiversity domain will be processed, with the ultimate goal to expose them as a semantic knowledge graph.To build this graph, we aim to combine different sources of information about any given dataset including the (tabular) dataset itself, existing metadata and abstracts or full texts of associated publications.

P6: Deep learning for data analysis in gravitational wave astronomy
The so-called Phenom-models (Ajith et al. 2011), one of the two standard methods for the analysis of the first gravitational wave signal (Abbott and et al. 2016), were established by hand.In this project, these models will be significantly improved by self-learning algorithms.Essential for methods of this sort are analytical and numerical gravitational wave templates, being computed in Jena for different sources such as black holes and neutron stars.
The goal is to combine the current wave templates for concrete physical problems with new machine learning methods for data analysis.Since this field of research has only just started to attract international attention (George andHuerta 2018, Shen et al. 2017), the integration into the large framework of the MSCJ, involving both physicists and computer scientists, will play an important and fruitful role.

P7: Combined analysis of image data of head and neck cancer
Large multimodal image datasets accumulate before, during, and after therapy of head and neck cancer.The data mainly arise from endoscopic and microscopic images gained during surgery, histopathological images from tissue samples, proteomic data from MALDI imaging, as well as pre-therapeutical and post-therapeutical radiological imaging data used for tumor staging and therapy control (computed tomography, positron emission tomography, magnetic resonance imaging).So far, these different image data sources are not linked to each other, although it has been shown that the combined analysis of such data allows a much better prediction of tumor control and outcome (von Eggeling et al. 2012, Bogowicz et al. 2017b, Bogowicz et al. 2017a, Oetjen et al. 2013).Aim of pilot project P7 therefore is the aggregation of all data collected during the patient's therapy on a common platform.The data will be analyzed with machine learning tools.The aim is the breakthrough for the development of more precise diagnostics and better therapy for head and neck cancer.

P8: Use and reuse of MRI data in biomedical research environment
Assessing the data quality of biomedical imaging data, and particularly MR imaging data, is an essential prerequisite for the preparation and application of big data methods in clinical science.Conventional approaches for evaluating the data quality of magnetic resonance imaging (MRI) data are largely limited to certain sub-aspects and are often contrastspecific (Vogelbacher et al. 2019).Quantitative MR contrasts, however, such as, e.g., quantitative susceptibility mapping (QSM) (Reichenbach et al. 2015) or quantitative diffusion imaging (Güllmar et al. 2017), require careful quality assessment before mapping.Holistic, contrast-independent procedures are currently missing.Against this background, the objectives of this project are as follows: (1) Identification of high quality, artifact-free data sets using image-based approaches; (2) Identification of potential bias factors in the datasets; (3) Consideration and/or correction of these factors; (4) Implementation of socalled image-derived phenotypes (IDPs).A future main area of application of such data could be the assessment of disease time courses based on previous knowledge, whereby the retrospective comparison data should be identified by using the IDPs.

P9: Development, digitization and establishment of sensor-based phenological observations
Monitoring biodiversity is one of the major tasks in the light of global change and was identified as such in the coalition agreement of the German government.Changes in phenology which are also referred to as "fingerprint of climate change" (Menzel and Fabian 1999, Parmesan and Yohe 2003, Root et al. 2003) lead to changes in biodiversity (Miller-Rushing and Primack 2008) and impact ecosystem services (e.g.provisioning of flowers for pollinators) on a landscape scale (Senapathi et al. 2015).In order to mitigate effects of climate change, an objective monitoring of ecosystems is important.
The development of digitized, species-specific monitoring of phenological data in different ecosystems using sensors linked to automated image-analysis can be seen as a next step to objectively collect extensive data on plant phenology (BigData).This is not only important in the light of global change, but also offers the possibility of species-specific, automated monitoring in areas such as remote sensing and applied ecology.

P10: Data-driven virus diagnostics at multiple levels II (application)
In this project, modern analysis and learning methods are applied to achieve a breakthrough in virus diagnostics.It forms a unit with P3 (methods).On the experimental side, a fluidic chip system (artificial blood vessel) is used to infect host cells with influenza A viruses and secondarily with a pathogenic bacterium (MRSA) in a controlled manner.Furthermore, patient samples (urine, sputum, blood, BAL, CSF) from the University Hospital Jena are available and we have access to the CAPNETZ biobank with about 8000 serum samples.In the AG Marz new methods of real-time sequencing will be developed and applied (sequence level).
Real-time detection will be achieved by the MinION system (Oxford Nanopore Technologies) recently established in Jena.The system has the dimensions of a USB stick and can also be used directly on the patient.The sequencing of host DNA/RNA will be automatically stopped in real-time and another DNA/RNA fragment will be sequenced instead.The sequencing platform has already been established and tested.The tipenhanced Raman scattering (TERS) will be used for pathogen detection in complex samples (image plane).The group in Jena has already successfully detected viruses using TERS under defined conditions.Here, protocols are being developed that allow the detection of pathogens in complex patient samples.A combination of dark-field and AFM methods ensures a fast pre-characterization of the sample, which simplifies the actual TERS characterization.As a result, we expect that the new sequence level (P10), image level (P10), molecular level (P3) and network level (P3) methods will contribute to a significant improvement in individualized real-time diagnostics.

Figure 1 .
Figure 1.Integration of the "Werkstatt" in the Friedrich Schiller University and the research center of Jena.