BridgeDb and Wikidata: a powerful combination generating interoperable open research (BridgeDb)

Like humans have a unique social security number and different phone numbers from various providers, so do proteins and metabolites have a unique structure but different identifiers from various databases. BridgeDb is an interoperability platform that allows combining these databases, by matching database-specific identifiers. These matches are called identifier mappings, and they are indispensable when combining experimental (omics) data with knowledge in reference databases. BridgeDb takes care of this interoperability between gene, protein, metabolite, and other databases, thus enabling seamless integration of many knowledge bases and wet-lab results. Since databases get updated continuously, so should the Open Science BridgeDb project.


Project proposal The vision for your project
Linking any two or more databases always requires linking identical entities described in those databases. Unfortunately, the identifier used for the same entity in one database is often different from the identifiers for the same entity in the other database. BridgeDb was created to make the bridge between databases by providing uniform access to mappings between different database identifiers for the same entities. This is why BridgeDb is a Recommended Interoperability Resource (RIR) of ELIXIR, a collaboration of leading life science organisations, and has been supporting projects like the ELIXIR-NL WikiPathways resource (Slenter et al. 2017).
The vision of this project is to improve the foundation of BridgeDb, to allow us to widen the scope in the future and enhance the support of currently unsupported, but important data sources. This will open up the road to wide adoption in the European Open Science Cloud (EOSC). To reach this vision, we aim to 1. modernize the project by updating the library and accompanying build system, 2.
extending the functionality of the webservice to deploy identifier (ID) mapping databases effortlessly, by extending the support of creating ID mappings databases from Wikidata (Waagmeester et al. 2021, Waagmeester et al. 2020, and 3.
by updating the tools to create ID mapping databases, along with new archived and citable releases for the genes, proteins, protein complexes, metabolites, nanomaterials, adverse outcome pathways, and journal articles ID mapping databases.
The first output of this project is an improved BridgeDb Java library (Batchelor et al. 2014, van Iersel et al. 2010, using the stable build system Apache Maven and following its practices, higher test coverage, including automated testing of the MySQL backend, and higher coverage of JavaDoc (see WP1 below). Second, the project will produce a new version of the live BridgeDb Webservice (

Project plan
The project plan is organized in three work packages (WP1, WP2, WP3), following the three output themes. Work package 1 (WP1) intends to upgrade the BridgeDb Java library. Currently, the main Java library is already built with Apache Maven, however, the build system should also be applied to related tools, and we will extensively use GitHub Actions for automation. Second, only a subset of library modules is currently available as OSGi bundles, which is essential for reuse in various third-party tools, like PathVisio  and Cytoscape (Kutmon et al. 2013, Shannon et al. 2003. Therefore all modules will be extended to support OSGi bundles, something that is already done for five core BridgeDb modules. Furthermore, to improve maintainability, WP1 will continue extending the unit tests and integration tests. Particularly, the testing of the database backends that hold the ID mapping data (Apache Derby and MySQL) needs to become more comprehensive.
Work package 2 (WP2) focuses on the BridgeDb Webservice. This continuously running service is an ELIXIR RIR and daily supports projects like WikiPathways and Cytoscape to assist data analysis of omics datasets (transcriptomics, proteomics, metabolomics, etc.). The Webservice will be extended to support Compact Identifiers (Wimalaratne et al. 2018) as a new input and output format, in order to support persistent, machine-resolvable citation of research data in written material. Furthermore, we will introduce support for JavaScript Object Notation (JSON) as a serialization format for multiple application programming interface (API) calls. The OpenAPI (Swagger) interactive documentation will be updated accordingly. Furthermore, the Webservice itself will become even more FAIR, by adopting the DataCite standard, and providing provenance in the HCLS Community Profile for Dataset descriptions.
The last work package (WP3) translates the new functionalities to practical use cases. In this WP, existing ID mapping databases will be updated, using the new releases of BridgeDb Java library and tested in applications using the new BridgeDb version. We intend to widen the scope of ELIXIR resources supported in the ID mapping databases, to make more resources interoperable (and therefore more FAIR). Here, we will increasingly use Wikidata and its international scientific collaborations (Waagmeester et al. 2021, Waagmeester et al. 2020). These mapping databases will continue to be released via public archives (e.g. Figshare, Zenodo) under open licenses, and indexed on the BridgeDb website annotation at bridgedb.github.io/data/gene_database/. To do so, WP3 will develop a tool that takes DOIs of the mapping databases as input to extract metadata from the respective repositories and generate this indexing website. WP3 will test the resulting mapping databases with downstream tools (PathVisio, WikiPathways, Cytoscape, etc.). Docker Images of the various tools will be developed to simplify dissemination and reuse. Practically, this work will involve two hackathons involving the senior scientific employees (Slenter, Kutmon, Martens) and the full-time non-scientific personnel (see the Section Team members and Table 1).
H H

Team members
The funding will be used to employ a scientific programmer. Additionally, from the Dept of Bioinformatics (BiGCaT), the following people will be involved for WP3 for testing the upgraded BridgeDb library to create updated ID mapping databases. Denise Slenter (orcid:0000-0001-8449-1318) will work on the metabolite, disease and interaction ID mapping databases, Dr Martina Kutmon (orcid:0000-0002-7699-8191; assistant professor) on the gene and protein ID mapping database (with Ensembl as source), and Marvin Martens (orcid:0000-0003-2230-0840) will work on a gene and protein mapping databases for Daphnia magna and Daphnia pulex (relevant model species for toxicology, but currently not in Ensembl). Slenter, Kutmon, and Martens have all been previously involved in the BridgeDb projects in their research projects (e.g. created the Docker Image for BridgeDb and using Wikidata as a source of ID mappings), and are experts in the fields relevant for these mapping databases: chemistry and metabolism (Slenter); systems biology and data analysis (Kutmon); toxicology and Adverse Outcome Pathways (Martens).

Data management
Will this project involve re-using existing research data?
Yes. Where existing data is reused, these will have an open license or a public domain waiver (like the American public domain or the international CCZero waiver). Any license, including open licenses, constrain the reuse. License information will be clearly provided, following the FAIR principles.

Will data be collected or generated that are suitable for reuse?
Yes, reuse is the aim of the BridgeDb project, where downstream users are, for example, WikiPathways, PathVisio, and Cytoscape.
After the project has been completed, how will the data be stored for the long-term and made available for the use by third parties? Are there possible restrictions to data sharing or embargo reasons?
Data will be archived during the project in public repositories, like Figshare and Zenodo, which have committed themselves to availability of 20 years or more. The open licenses allow other repositories to archive a copy of the data.
No restrictions (other than the open license terms) and no embargoes are anticipated.

Will any costs (financial and time) related to data management and sharing/ preservation be incurred?
No: All the necessary resources (financial and time) to store and prepare data for sharing/ preservation are or will be available at no extra cost.

Software sustainability
Will software be generated during the project? Yes.
How will the software be licensed and be made available for re-use?
All BridgeDb software is available under an OSI-approved license on GitHub. This includes the Apache License 2.0-licensed BridgeDb library as well as the existing source code to generate ID mapping databases, available under other open licenses (see Table 2). What measures are needed to make the software appropriate for long-term (re-)use by third parties?
WP1 will improve the maintainability and portability of the software. The main BridgeDb Java library is developed on GitHub and disseminated via Zenodo (using the GitHub-Zenodo integration) and via Maven Central (search.maven.org/search?q=g:org.bridgedb).
How large do you expect the community that will potentially use the software to be, and do you expect outside contributors to the software?
The size of communities is hard to accurately estimate, but with the highly cited WikiPathways (monthly 15,000 unique website users) and Cytoscape projects as daily users and being an ELIXIR Recommended Interoperability Resource, we estimate a few thousand daily users. The gene/protein ID mapping database is downloaded more than 14 BridgeDb has been used in EU projects like OpenPHACTS, OpenRiskNet, and NanoSolveIT. A full list of past contributors can be found on GitHub for each of the subprojects, e.g. at github.com/bridgedb/BridgeDb/graphs/contributors.
What expertise do you expect to be needed to make the software appropriate for long-term re-use by third parties? Is this expertise available?
The main applicant has more than 20 years of experience in the development of open data, open-source, and open standards projects, and the BridgeDb project already exists for over 10 years. As Editor-in-Chief of a journal that has reuse and Open Science as strong editorial standards, the required expertise is available.

Other grant applications with overlapping content
No overlapping grant applications.