Research Ideas and Outcomes :
Research Article
|
Corresponding author: Samuel Bentum (sbentum@indiana.edu)
Academic editor: Editorial Secretary
Received: 19 Apr 2023 | Accepted: 20 Jun 2023 | Published: 09 Aug 2023
© 2023 Samuel Bentum, David Wild
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Bentum S, Wild DJ (2023) Digital transformation strategies for applied science domains. Research Ideas and Outcomes 9: e105197. https://doi.org/10.3897/rio.9.e105197
|
|
The key hallmark of a digitally minded organisation today is seen in their rapid advancement, globalisation, innovation and resilience to change. Companies that wish to thrive must be prepared to adapt to the new digital reality. Being digitally minded does not mean implementing new technology, investing in tools and upgrading current systems. These stages are critical, but they are not the entire picture. If a company wants to remain competitive, it must not just be able to adapt to changes, but also anticipate and drive innovation. Companies must plan ahead and be proactive architects of their future in order to achieve this vision. This is where a digital transformation strategy is crucial. A digital transformation strategy assists organisational leadership in addressing challenges about their business, such as the present level of digitisation and a digital maturity roadmap. Although diverse data capturing technologies and data-generating assets exist, material/chemical science domains, such as R&D and Manufacturing groups, struggle to harness the full power of their data. A typical industry will have significant data sources generating large amounts of data stored in siloed databases with minimal to non-existent cross-talk. This in part creates scenarios for researchers to be able to perform a deep dive in one set of data, but unable to co-populate and harness the interdependences or relationships amongst the different datasets. This paper seeks to define, distinguish, aggregate and propose an integrative approach to utilising the various types of disparate data sources commonly encountered by researchers in the field of their material science research. The main focus here is defining strategies to harness insights across integrative data to aid in efficient research in R&D organisations as these industries seek to embrace the power of digital transformation. Although the principles described here relate to industries in the applied science domain, the general strategies proposed can be applied to other industries on a case-by-case basis.
digital transformation, digitisation strategy, applied science, data management, systems integration, KNIME
Digital transformation is occurring across organisations, from the pharmaceutical industry to the service industry, bringing such benefits as better decision-making and faster processing as information is shared instantly. A decade ago, prospects for data-driven progress were hardly imaginable, but now they are becoming more and more possible because to falling computation costs and more accessibility to cloud-based analytics. Chemical plants create enormous volumes of data, much of which has the potential to be used to improve efficiency and raise yields, similar to the majority of large-scale industrial facilities. For instance, according to recent research, chemical producers might boost their return on investment by as much as 5% by merely digitising their product processes (
To curb the above challenges requires an integrative data management approach which allows researchers to accurately analyse and understand their results as they try to optimise the properties of a new material. If simulation techniques or machine-learning are being used to reduce the number of experiments needed, then integrative data management becomes even more important, as algorithms need structured information to work.
According to the MGI programme, developing new materials for next generation application use can be a much arduous, time-consuming and very expensive process. A typical development cycle spans between 5 to 10 years (
A fully-integrated stream of R&D data sources enables researchers to ask questions that can accelerate their product development. A typical objective of a research group might be exploration of virtual product development geared towards faster material development and better understanding of certain composition and performance yields. Another area of interest resides in the in-silico synthesis of materials or chemicals, whereby in new product or process development, researchers are able to determine upfront which process, composition or catalyst are needed to produce products with desired properties and yield. In most cases, there are data available which do not cover the entire research life cycle; process/chemical structure/property/performance. Hence, it is essential to include advanced calculation constraints when attempting to optimise inputs and outputs in a predictive synthesis model. An example of such a case would be studying structure property relationships in-silico where structure is defined as an input as opposed to process data which may not be complete or accessible easily. Full utilisation of advanced analytics and modelling tools requires transparent data access for researchers working in this industry.
COVID-19 changed R&D work across many scientific areas. Labs were compelled to handle operations remotely and scientists relied on digital technologies to keep vital research moving forward. Some jobs were amenable to conversion to the remote world. Instead of meeting in rooms, researchers collaborated through Zoom®. They planned work schedules and carried out data analysis from their homes. Tasks that needed physical presence in the facilities to prepare and load samples, as well as study analogue data in real time, were much more complex. Researchers had to struggle with ensuring the right process parameters and reagents were fed into their synthesis tools remotely. One of the greatest challenge faced by scientists in the chemical industry space was real-time access to data generated in their labs remotely during the pandemic era. This issue was magnified during the pandemic era, though it has always been a challenge for non-digitalised R&D organisations. The effects of the pandemic’s work-from-home rules in 2020 and 2021 have provided many companies the chance to re-evaluate the amount of technological debt they are carrying with legacy informatics systems and how outdated methodologies do not help the digitisation of their business. This has enabled organisational leaders (including those in the chemical sciences domain) to speed up comprehensive modernisation plans for their research laboratories and process operations.
The commercialisation of new product and chemistries is a very complex and resource-intensive process. The time taken from conceptualisation of research ideas to final product delivery to customers is an expensive journey filled with multiple iterations of development failures and small successes. Take for an example the process required to come up with a new material for the industry as depicted in the chart in Fig.
As discussed in the previous section, the process to fully develop and commercialisea new product on the market requires extensive time commitment and resources. From Fig.
In an effort to truly harness the power of the huge trough of data generated in today’s research labs, organisations require a digital strategy covering data management and lab connectivity protocols. To be clear, implementing a digital strategy requires more than merely upgrading the technology in laboratories; rather, the entire corporation must be considered. Such an initiative re-imagines how people work and engage with one another, focusing as much—if not more—on people than technology. The key to a successful transformation is a deliberate focus on talent and skill, which enables employees to integrate their scientific duties with new technology, which in turn, connects the many laboratories to one another across the company. However, it is crucial to pay great attention to each individual lab type and the difficulties particular to the people and procedures in each environment. Despite the fact that each type of lab is distinct and has its own special skills, procedures and technology, they are all interrelated and mutually beneficial. As a result, all laboratories should be evaluated and considered as equally capable of creating increased value.
Gaining access to all of your data sources is the first step in doing end-to-end data science. The process of gathering and shaping data from any source within an organisation is the core definition of integration. There are a number of unique technologies in the market now that address the different types and levels of integration (
The data entry and management of an organisation’s research is the dawn of their digital transformation journey and this goes on to show how far chemical and material science domains must go to attain AI-ready datasets. A fundamental step for any industry on this journey is to ensure the availability of its data in suitable formats logged electronically into a database system. However, as previously shown, the chemical and material science R&D is behind on data connectivity; even now the broad adoption of electronic lab notebooks is not widely applicable. The daunting issue of overcoming data silos have not gained much traction either. The good news is that there have been a number of solutions come on the market (
Over the last few years, data have become what is knowns as the next “oil” for organisations (
Machine-learning algorithms can quickly comprehend structured data, which is often classified as quantitative data. Structured query language (SQL), is one of the most widely-used computer languages for managing structured data (
However, unstructured data, often known as qualitative data, cannot be handled or evaluated using standard data tools and procedures. Unstructured data are best maintained in non-relational (NoSQL) databases since it lacks a specified data model. Another option for managing unstructured data is to store it in a raw form in data lakes.
Most recently, companies have hailed the advantages of AI in chemical and material science; with key focus on the speed of discovery, as well as the much simplified material compatibility assessment. The hidden truth most people do not hear: with poorly managed, unstructured data, none of these possibilities is conceivable. If organisations go into AI assuming that decades of stitched-together excel files would be the magic wand, then, unfortunately they are on a far longer (and more expensive) road than they ought to be on. It is not far-fetched for industries to start delving into their dataset only to find out significant portions of the data captured are unstructured and not as complete as they initially thought it would be. Not surprising in 2019, researchers at Deloitte (
In a typical material development workflow, researchers begin with formulation and synthesis, extrusion/molding, analytical characterisation and, finally, quality assurance release testing. This sequence highlights the different departments and data sources impacted in a development cycle. Each of these departments produce varying data types which are then housed in their specific domains and are usually read by specific software modules. As an example, the synthesis group will store data around pressure, temperature, viscosity and time during the reaction process. The formats of these data are naturally different when compared to results from an analytical test using gas chromatography or infrared spectroscopy. Coupled to these differences are also the fact that data are, most of the time, stored separately in the different departments (silo system) and there are limited to no cross-talk amongst these systems. The different formats and types of data generated through the development process in itself poses a major hindrance to researchers as they attempt to draw insights from historical data. A good practice for adoption by chemical and material science organisations is the power of leveraging systems integration. In the subsequent section, we will highlight the “as-is” situation in a typical R&D lab and propose unique ways of smartly integrating the different data siloes in research environments. This is the foundation for building an end-to-end data pipeline for all the disparate data sources needed to harness valuable insights for an organisation.
The proper integration of data and technologies is critical for all downstream operations, such as information exploration and knowledge management (
As an example, researchers working on developing new or improved materials or chemicals are typically bombarded with a host of datasets coming from all facets of the organisation about their particular project (Fig.
As a case study, let us consider an organisational research need of exploring material property predictions, based on formulation, process and analytical characterisation of a polymer material. As discussed previously, the different process outlined above generate different formats of data and tends to be siloed in a different business unit in an organisation. Hence, it is important to understand what types of data formats are at play in this scenario.
Formulations are chemical and/or material compositions that are homogenous. To manufacture polymer composites, formulations can be mixes of solids or solvents-based components and can comprise of oligomers, fillers, pigments and other additive materials. They can be basic unreactive chemical blends, reactive mixes where the sequence of mixing matters or blends of blends where the ultimate molar composition is determined by the original blends' composition. All of this means that there is a plethora of conceivable components, each with its own set of limitations governing its kind, number, molar percent and mixing order. Typically new formulations are usually created in response to a customer request that includes several component and process limits as well as property goals. Winning or losing major contracts depends on swiftly recognising what you already have and what can be quickly altered or expanded to meet new business prospects. Buying, storing and processing hundreds of different components is expensive and there are apparent cost reductions to be obtained by rationalising the process. Ingredient costs vary and formulations that meet goal qualities while incorporating lower-cost constituents increase profits. As global events have an influence on supply chains, the capacity to swiftly reformulate utilising materials from a new supplier has become increasingly important. It is also important to remain adaptable when new regulation on prohibited chemicals takes effect. For data management and machine-learning, the formulation space provides a number of unique issues. A typical material manufacturer will have hundreds, if not thousands of different formulations for their products. Each formulation accounts for a specific materials' property that is tied to a business need or sales order. Thus, it is crucial for manufacturers to maintain the specific ingredients and quantifies that produce the unique property of the material specified for the market need. These formulations are typically recorded as ingredients names and weights; hence, the data are mostly in a flat file format in a database that can be extracted as csv or any tabular form. However, these data by themselves are not enough to harness and improve rapid re-formulation and new material development without the right process parameters captured. Organisations can generate huge economic value by leveraging a framework that properly captures complicated process flows, understands molecular structures and utilises the deep subject knowledge of corporate specialists.
Imagine watching a recorded cooking show on TV, but for some reason, the network cuts out after the chef shows the ingredients and amounts needed to make the meal. The network returns after the chef has completed the meal and it is served on the plates, essentially cutting out the entire cooking steps. The question then is, can the viewer make the exact same food as shown by the chef with the given ingredients without knowing the steps? The obvious answer is, no. Similarly, as discussed above, formulation of materials are key to organisational success on delivering their products; however, the process to convert the ingredients to the product is as important as the formulation itself. For each material or chemical made by companies, there are additional tonnes of (meta-)data optimised and collected for the process. These process data points are dependent on the synthesis approach or equipment used in making the materials. During synthesis of compounds or materials, data such as temperature, feed rate, pressure, flow rate, screw dimensions and others are routinely recorded in their native software systems of the instrument used. Historically in the manufacturing world, these data were used to monitor and optimise production process, identify any potential production issues or system troubleshooting. Engineers and scientists also used these process data to help make data-driven decisions to improve the overall material synthesis and production performance. However, with the advancement of machine-learning and visualisation tools, these forms of data can be utilised to generate more than process upsets if the data are collected and stored properly.
Process data in R&D organisations or manufacturing, in general, can be formatted in a variety of ways, depending on the specific application and the type of data being collected. Some common data formats include:
The data format used will be determined by the application's unique needs and the type of data being gathered. It is also critical to ensure that the process data are kept in a manner in which the appropriate stakeholders can readily evaluate and comprehend. Data files, such as numerical, time-series and log files, can be saved as CSV or TXT-based flat files.
Analytical characterisation of a material is the process of identifying the material's chemical and physical characteristics using a variety of analytical techniques. These techniques can range from Gas Chromatography (GC) to Differential Scanning Calorimetry (DSC), Transmission Electron Microscopy (TEM), InfraRed Spectroscopy (IR) and many others. The main goal of using these techniques is to help provide understanding of a material's structure and composition at the microscopic level and potentially also aid in troubleshooting any impurities or performance defects of products. Analytical characterisation of materials may be applied in a variety of ways, including the creation of new materials, maintaining the quality of already-existing materials and analysing failures in materials already in use. It is also important in the realm of materials science, where researchers utilise it to explore the characteristics and behaviour of many types of materials. Data from material characterisation come in various formats depending on the type of analysis being carried out. Some examples of material properties measured during product development include: composition of material, crystalline structure, morphology of blends, thermal properties, mechanical properties, rheological properties and a host of other measurement types. The data from analytical techniques are usually the most challenging form of data to deal with from the onset. Such a department would have access to multiple types of instrumentations with numerous softwares that run each item of equipment. It is important to highlight here that the different data generated from test instruments take on formats driven primarily by the software output settings. As an example, output files from a Gas Chromatography Mass Spectroscopy (GCMS) instrument are mostly saved in either mzML (XML based format) or JCAMP-DX (a proprietary based format). The output data for DSC instruments are also saved in either plain-text .DSC file or CSV format. Hence, if a researcher were to ask the question “how do I compare the mass spectroscopy m/z distribution pattern to the heat profile generated from the thermal analysis of a tested material?”, one would manually have to perform data analysis on each piece of dataset independent of the other dataset because the file formats are distinctly different. Today, there is no software in the market that can automatically read and combine all the different formats of data files from lab instruments together. However, in the example above, if organisations have a truly digital integration pipeline, researchers would be able to ingest all formats of data, albeit manually and read them in a unified output language they can analyse. In this case, therefore, the researcher can parse and convert the GCMS mzML data into the CSV file format using, for example, python nodes in a pre-built integration pipeline. With both sets of data structured in the same format, the user can then evaluate the results, based on a time-stamp as both sets of data can effectively be treated as a time-series for ease of visualisation.
In general, analytical characterisation of materials is a key step in deciphering the microscopic physiognomies and behaviour of materials, which may be utilised to enhance the performance and dependability of materials in a variety of applications. Of course, no single data point from one instrument tells the entire story; hence, the need to have a comprehensive set from all measurement data types.
It is also worth mentioning here that to truly harness the power of the above described datasets in performing predictions of new materials, another key piece of data point to consider is first principle modelling of chemical reactions (
Another importance aspect of dealing with the data from analytical instrumentation is ensuring data are stored in a common database, such as MySQL, PostgreSQL, MongoDB and others. Having data stored in the right system formats helps for a smooth data extraction process. The process to extract the data from the instruments' data sources can be programmatically encoded through scripts in languages, such as Python, R or MATLAB (a common language amongst research engineers in the industry). For ease of use, a no-code/low-code solution is highly preferred in the chemical/material science industry segment as the level of programming language knowledge is not mature within the research scientist skillset. If coding can be avoided in the initially phase, it will help with gradually bringing key lab scientist up to speed with what an integrated dataset can bring to their research. Numerous data integration tools exist today that can be run with no-code experience needed to extract and combine data from different instrument API sources. Examples include Talend, Informatica or MuleSoft. Certainly, organisations can also build their unique pipeline of data integration platforms using the many configurable options, such as KNIME and SciTergic Pipeline Pilot tools.
Traditionally, ELN systems are meant to replace paper notebooks with digital analogue documentation platforms. They are typically used in wide array of organisation sectors, such as chemical industry, pharmaceutical and food/beverage industries (
In practice, the primary application of ELN in R&D projects is to replace paper notes, ease information flow and comply with intellectual property restrictions. Having said that, the ancillary data created through ELN serves as a rich metadata source to augment formulation, synthesis and process data transformation to machine learning (ML)-ready datasets.
LIMS is a sample workflow-driven tool which is used to monitor and record all the data generated by a process. It is widely known for its sample test data management and consolidation (
A similar shortcoming of LIMS system by itself is that test data are not inherently tied to formulations or business data. Hence, test data of samples synthesised by specific formulations documented in ELN are not integrated comprehensively for a true end-to-end research workflow. This is an inefficient way of systems set-up as companies prepare to be digitally transformed.
From the data stream formats discussed above, the next logical step is to define a workflow-based integration approach that ingests and transforms data into ML-ready assets using a pipeline tool.
Today, a significant majority of cheminformatics specialists and data scientists are increasingly using web servers for data processing and automation. The use of these web-based technologies lowers the barrier of computational requirements needed to process large chunks of data. In fact, server-side scripting languages like Ruby, PHP and ASP can be utilised to automate data processing, file manipulation and database communication even via APIs (
Data integration is a critical step in the data mining process, in which data from multiple sources is integrated into a single, unified data repository. There are many tools available to facilitate data integration, including ETL (extract, transform and load) tools, database integration tools and data warehouse tools. As discussed earlier, the open source data integration platform KNIME is one of the most popular tools for data integration.
KNIME (Konstanz Information Miner) is a free and open-source data analytics, reporting and integration platform. It is used in a wide variety of data-driven applications, including data mining, machine-learning, data visualisation and predictive analytics (
From the above workflow, the key steps to consider are listed below.
These are just examples of the various nodes that could be used in a KNIME workflow for data integration. The actual workflow will depend on the specific requirements and data sources. There are over 13,000 workflows already developed by KNIME contributors on the community platform and this could serve as a point of reference for users. Further, there are thousands of nodes with descriptors from which a user can build a workflow. KNIME's use of scalable machine-learning is an intriguing feature. Some of these algorithms use naive Bayesian models or similarity searches to do virtual screening, with most options being predefined. Nonetheless, integrating scripts from programming languages with machine-learning libraries (such as R and Python) is one approach for increasing flexibility in KNIME operations.
We have designed a KNIME workflow for data integration that could serve as an example of multi-conversion nodes and data aggregator in Fig.
To develop procedures that examine various facets of chemical space, a wide variety of chemoinformatic resources are accessible. These resources are being used in custom workflows or open web servers. These technologies serve not just cheminformaticians, but also members of interdisciplinary teams inside businesses who are either non-experts or do not have the time to create their own code or procedures from scratch. Below, we provide a non-exhaustive list of tools/vendors on the market today that can help organisations in the inception of their digital awareness and transformation journey (Fig.
We classify three (3) key domain users in any organisation, based on their digital expertise level and experience. The core domain user-group will be the full-time data scientists or cheminformaticians hired and fully dedicated to programming and data governance structuring. This group will have the capacity to utilise a deep programming language like R and Python to extract, prepare, explore and build predictive models on datasets. The next related group is the citizen-data scientist, who happens to be fairly knowledgeable in data science tools as well as possessing domain knowledge of the business needs. This group can utilise the low-code/no-code platforms to build insight and push data-driven decision-making across the organisation. Merkelbach et al. provide a nicely documented approach to enabling internal organisational domain experts to become citizen-data scientists (
These tools are expected to evolve and improve in the future. It is important to avoid having the user-friendly web server apps turn into unusable black boxes. To completely optimise the interpretation of the findings, it is critical that the user fully understands the computations that are performed. The user also has to be aware of the approximation and potential constraints of the application or workflow. Moreover, organisations should not shy away from approaching technical experts in the field of this and many other data tools available to them in their unique situations. The majority of these vendors offer small sand-box exercises to generate excitement and value for a use case that will be beneficial to both parties. Therefore, if an organisation is not well-versed with citizen-data scientists or data scientists in the field of AI and other ML programming languages, the key recommendation is to engage with select domain specific vendors in a sandbox proof-of-concept to create a successful use-case story. However, of course, success is dependent on the data availability; hence, the data integration step is always going to be the first step for a truly digitalised organisation.
Positive change required in data management is frequently hampered by silos, whether they be operational or informational. This is especially true for data from the material and chemical science industry. Integrated data sources are critical in the research lab because they provide researchers with a comprehensive and centralised view of their research. This aids in decreasing data duplication and discrepancies, facilitating data analysis and enabling effective data administration. Furthermore, by giving access to up-to-date and correct information, connected data sources promote team communication and enable better decision-making. Thus, the usage of linked data sources can lead to enhanced research outputs, higher productivity and overall lab efficiency.
As organisations embark on a digital transformation journey, having an integrated data lake from research labs is very critical to the success of application of ML algorithms for new formulation and material improvement predictions. In order to train and validate ML models, integrated lab data sources are necessary. To make reliable predictions, ML models require a vast amount of high-quality, diversified and consistent data. The data utilised for training and validation may be more thorough, accurate and up-to-date by combining data sources, according to experts. Furthermore, an integrated data source makes it simple for researchers to contribute fresh data to the model, allowing it to continuously improve its predictions over time. Consequently, integrated data sources are essential for the success of ML in the lab since they lay the groundwork for developing new and better predictions.
Breaking down large, monolithic lab programmes that have grown into sources of technical debt and transformational roadblocks is a key step in digital transformation. An organisation may begin to recognise and appreciate the advantages of digital transformation by removing obstacles to information exchange amongst the various lab types and facilitating better flow and access to data. All laboratories, despite the fact that they may not be constructed equally, should be viewed as equally significant components of a system that can offer long-term operational and business benefits through quicker, more integrated data and processes.
Summary of data integration steps for the lab environment:
AI – Artificial Intelligence
API - Application Programming Interface
R&D – Research and Development
ML – Machine Learning
NIST – National Institute of Standards and Technology
CSV – Comma Separated Value
mzML – XML-based format for Mass Spectroscopy output files
XML – eXtensible Markup Language
JSON – JavaScript Object Notation
KNIME – Konstanz Information Miner
API – Application Programming Interface
MGI – Materials Genome Initiative
IP – Intellectual Property
SQL – Structured Query Language
LIMS – Laboratory Information Management System
ELN – Electronic Lab Notebook
PLC – Programmable Logic Controller
DCS – Distributed Control System
SCADA – Supervisory Control And Data Acquisition
IOT – Internet of Things
TXT – Text
GC – Gas Chromatography
DSC – Differential Scanning Calorimetry
TEM – Transmission Electron Microscopy
IR – InfraRed Spectroscopy
GCMS – Gas Chromatography Mass Spectroscopy
JCAMP-DX – Joint Committee on Atomic and Molecular Physical Data
m/z – Mass to Charge ratio
RDP – Remote Desktop Protocol
IT – Information Technology