Research Ideas and Outcomes :
Grant Proposal
|
Corresponding author: Egon Willighagen (egon.willighagen@maastrichtuniversity.nl)
Received: 07 Apr 2024 | Published: 17 Apr 2024
© 2024 Egon Willighagen, Marc Teunis, Alyanne De Haan
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Willighagen E, Teunis MAT, De Haan A (2024) The Chemistry Development Kit in 2024: improving cheminformatics research. Research Ideas and Outcomes 10: e124884. https://doi.org/10.3897/rio.10.e124884
|
|
Cheminformatics is the research field that deals with information about chemical systems. This includes the chemical structure which is used in computational chemistry where quantum chemistry is too complex. The Chemistry Development Kit (CDK) was one of the first Open Science libraries in chemistry, co-founded in The Netherlands. The source code goes as far back as 1997 and has been maintained for more than 25 years. The CDK is used by many tools in drug discovery, computational toxicology, and bioinformatics. This project will develop improvements to the core library and update tools using the CDK to use the latest release.
cheminformatics, Chemistry Development Kit, Java, open science, OSF23.2.097
The Chemistry Development Kit (CDK, research-software-directory.org/software/cdk) (
The most recent CDK release is CDK 2.9 of August 2023 (
The vision of this project is two-fold. For the first part (see WP1), this project will improve the CDK library by introducing and making it compatible with newer Java programming language features. While developing these, the coverage of JavaDoc and unit testing with JUnit will be extended. Moreover, OSGi support will be improved, where a current limitation is that multiple OSGi bundles contain the same Java package, causing problems with OSGi-based software, like OpenChrom (
The second part of this project will focus on updating tools using the CDK to the latest CDK version (currently 2.9, but later versions when they are released; see WP2). Tools of particular interest for updating are JChemPaint (
Finally, we will present the results of this project at an open, international user group meeting (WP3). This meeting will be open to presentations from any project around CDK-based tools, both about the tools and about scientific research using those tools.
The project plan is organized in three work packages (WP1, WP2, WP3, see also Table
Gantt diagram of project work timeline. In the months M3 and M5, two two-day hackathons (H) will be organized.
M1 |
M3 |
M5 |
M7 |
M8 |
M11 |
|
WP1 |
H |
|||||
WP2 |
H |
|||||
WP3 |
UGM |
Work package 2 (WP2) focuses on the tools using the CDK, particularly JChemPaint, AMBIT, ToxTree, rcdk, OpenChrom, DECIMER, KNIME, the ChemViz plugin of Cytoscape, and PaDEL-descriptor. JChemPaint is an open source chemical structure editor, but not currently based on the latest CDK version. KNIME is one of the tools using JChemPaint, for which the CDK extension cannot be fully updated until JChemPaint is. AMBIT is the open source chemistry database frequently used in the field of toxicology and will benefit from newer CDK functionality too. Toxtree, OpenChrom, rcdk, DECIMER, are other tools that will benefit from improvements of recent CDK versions. Updating the PaDEL-descriptor software will be the most challenging, but with its many users and citations will have a significant impact on molecular property prediction research.
WP3 will organize a final user group meeting (UGM) where the project results will be presented and where users of the CDK will be invited. The open UGM will consist of a session with presentations about projects using the CDK and a workshop where we will collect needed future CDK functionality and improvements.
The funding will be used to fund the work of three researchers, Alyanne De Haan (orcid:0000-0003-0896-0906), Marc Teunis (orcid:0000-0002-3496-6669), and Egon Willighagen (orcid:0000-0001-7542-0286). Willighagen is co-founder of the Chemistry Development Kit and researcher at Maastricht University and has been involved in many other Open Science projects, involved in the Dutch Open Science Community, former editor of the Journal of Cheminformatics, and leader of various open scientific software projects. Teunis and De Haan have extensive knowledge about open educational resources, version control, and generally open science approaches to support research.
Dr Egon Willighagen has been active in Open Science for over 20 years, for example, contributing to projects like JChemPaint (since 1998; doi:10.3390/50100093), WikiPathways (since 2011; doi:10.1093/NAR/GKV1024), and (temporarily) leading projects like Jmol and coordinating the science in the EU FP7 project eNanoMapper (doi:10.3762/BJNANO.6.165), and co-founded the Chemistry Development Kit (in 2000; doi:10.1021/ci025584y). He is recognized for his work with the international Blue Obelisk Award (2007), a national runner-up Open Initiative Trophy (2021), and received a NWO Open Science Fund in 2022. For five years he was one of two Editor-in-Chief of the full CC-BY, highly ranked Journal of Cheminformatics (issn:1758-2946), which promotes Open Science in chemistry. At various National Plan Open Science events and meetings, Willighagen has provided input from a researcher’s perspective and is co-founder of the Open Science Community Maastricht. A more complete list of his Open Science work can be found in his publication list: orcid.org/0000-0001-7542-0286.
This project will focus on software, but if existing data is used, it will be open, to ensure compatibility with the open license of the CDK.
This is not expected.
This is not expected, but if this is the case, it will be shared via Zenodo.
No. All the necessary resources (financial and time) to store and prepare data for sharing/preservation are or will be available at no extra cost.
The Chemistry Development Kit has a wide user base that uses the CDK directly or indirectly via one of the CDK-based tools. The Java package “org.openscience.cdk” is mentioned almost 50 thousand times on GitHub, the software using the CDK cited many times, and searching “chemistry development kit” on Google Scholar finds more than two thousand hits. The potential is very significant.
The CDK is released under the GNU Lesser GPL license, which is similar to the GPL license but allows use in proprietary software, and only changes to the CDK itself need to be made available under the same open license. Our work on the CDK library will use the same LGPL license as the CDK.
Similarly, our updates in WP1 will be made available under the same license as the tool it improves. An overview of licenses of these tools is given in Table
Licenses of source code repositories of various tools using the Chemistry Development Kit library.
Name |
Source Code License |
Code repository |
KNIME (CDK Nodes) |
LGPL |
|
rcdk |
LGPL |
|
AMBIT |
LGPL |
|
JChemPaint |
LGPL |
|
ToxTree |
GPL v2 |
|
DECIMER |
MIT |
|
ChemViz (Cytoscape) |
LGPL |
|
OpenChrom |
EPL 1.0 |
|
PaDEL-descriptor |
public domain |
CDK releases are made on GitHub (github.com/cdk/cdk/releases), archived on Zenodo (doi:10.5281/zenodo.592588), and distributed via Maven Central. Tools using the CDK have various solutions (see the previous section). For example, rcdk is released via the CRAN network for R software, and KNIME has a custom release and distribution network. Patches to update these tools will be made available via GitHub pull requests and archived on Zenodo.
Expertise is needed about version control, continuous building, code refactoring, software testing, software build systems, packaging standards, and other standards for software development. The applicants have this expertise, as demonstrated in their research output.
While it is not expertise needed, for formal releases of software, we will depend on the release managers of the respective packages. Some tools, like KNIME, have a fixed release schedule, but many do not. For the CDK itself, the release manager currently is Dr. John Mayfield. Deployments of CDK ‘snapshot’ releases to Maven Central are done by the main applicant. Of the tools we seek to improve, we know most developers personally (see co-authored articles in the publication lists of the applicants).
The Chemistry Development Kit in 2024: improving cheminformatics research
Maastricht University