|
Research Ideas and Outcomes :
Grant Proposal
|
|
Corresponding author: Daniel Mietchen (daniel.mietchen@fiz-karlsruhe.de), Sheeba Samuel (sheeba.samuel@informatik.tu-chemnitz.de)
Received: 10 Sep 2025 | Published: 15 Oct 2025
© 2025 Daniel Mietchen, Sheeba Samuel
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Mietchen D, Samuel S (2025) FAIRJupyter4AI: A Corpus of Computational Notebooks for AI. Research Ideas and Outcomes 11: e171656. https://doi.org/10.3897/rio.11.e171656
|
|
Computational notebooks like Jupyter have transformed scientific and educational workflows in computational fields by combining code, text, and visualizations. They have also become a popular mechanism to share computational workflows. However, ensuring their reproducibility remains a persistent challenge due to often insufficiently documented direct and indirect dependencies, missing data, and inconsistencies in execution environments. Existing datasets lack the multimodal, fine-grained structure needed for AI applications. FAIRJupyter4AI aims to address this gap by creating a large-scale, AI-ready corpus of Jupyter notebooks enriched with executable code, markdown, outputs, and structured annotations. The project integrates these into a hybrid knowledge graph (KG) that incorporates symbolic, statistical, and execution-based representations. Key objectives include: curating diverse notebooks (initially Python, later R, with provisions for additional languages); automating reproducibility testing; building a KG for cross-notebook queries; training AI models for tasks like error repair and notebook generation; and fostering community use via APIs and integration with community platforms like NFDI or Hugging Face.
The project will be implemented using the infrastructure established by the NFDI Basic Service Jupyter4NFDI, in the upcoming Integration Phase of which (October 2025-September 2027) the applicants are actively involved. Its central JupyterHub provides cross-consortial and cross-institutional access to scalable computing and data resources and associated software stacks for both research and training purposes.
The FAIRJupyter4AI work programme is structured around five interlinked work packages: (1) Data Collection & Curation, (2) Reproducibility Assessment, (3) Knowledge Graph Development (4) AI Model Training, and (5) Communication, Community & Sustainability. Key innovations include continuous updates and enrichment pipelines (avoiding static snapshots), unifying multimodal content for AI, and bridging reproducibility with AI. Building on prior work involving 27,000+ notebooks and the FAIR Jupyter Knowledge Graph, FAIRJupyter4AI will curate, annotate and release over 20,000 notebooks that are research-related and openly licensed. In addition, we will share a metadata corpus for 50,000 research-related notebooks, along with open-source tools, models, and associated documentation. By making Jupyter notebooks metadata FAIR, reusable, and machine-understandable, this project will set a new standard for reproducible and AI-enhanced computational science, and it will open up new opportunities for learning and teaching about computational reproducibility across multiple domains of research.
Jupyter Notebook, Computational Reproducibility, AI, Jupyter4NFDI, Dataset
Computational notebooks like Jupyter (
FAIRJupyter4AI addresses these gaps by developing a large-scale, high-quality, AI-ready corpus of Jupyter notebooks that aggregates the full spectrum of notebook content - executable code, markdown narratives, rich outputs (e.g., plots, tables) - with associated data files and auxiliary information like error messages. This corpus is carefully cleansed and semantically annotated at a detailed level (linking code to explanations, tracing data provenance, and highlighting reproducibility-critical components). A knowledge graph that integrates symbolic, statistical (embeddings), and sequence-based representations (execution traces) to index and interconnect notebooks and their internal elements, enabling powerful structured queries and cross-notebook inferences while maintaining access to raw, annotated content.
The project encompasses the full lifecycle of corpus creation and utilization: from identifying use cases driven by AI modes and community priorities, defining an extensible and customizable corpus structure, to continuous data ingestion with strict quality control, FAIR (
FAIRJupyter4AI will deploy an automated, scalable reproducibility assessment pipeline that continuously tests notebook execution, environment dependencies, and resource use. Visualization tools will cater to both humans—through workflow diagrams illustrating key reproducibility steps—and machines—via structured JSON matrices. A nanopublication-based “news feed” mechanism will transparently report on reproducibility assessments, fostering community engagement.
A range of AI models, including large language models and code analyzers, will be trained and benchmarked on this richly annotated multimodal corpus. These models will support automated notebook generation and completion, fine-grained reproducibility diagnostics, intelligent error detection and debugging, and workflow optimization, pushing beyond conventional metadata querying to generation and validation aligned with best practices. Real-world benchmarking on imperfect datasets will improve AI robustness, reduce hallucinations, and enhance trustworthiness. To support extensibility and augment training resources, FAIRJupyter4AI will generate synthetic notebooks and cell sequences based on learned statistical patterns, provide APIs for community-driven data and functionality extensions, and align corpus standards with initiatives like NFDI, Wikimedia, and The Carpentries. By creating a continuously enriched, hybrid-structured corpus with integrated benchmarking, visualization, and community tools, FAIRJupyter4AI will become a foundational resource for researchers, educators, and infrastructure providers.
State-of-the-art
Computational reproducibility has emerged as a critical concern in modern scientific research, particularly with the increasing reliance on computational notebooks such as Jupyter for analysis, data processing, and result communication. Multiple studies have tackled various aspects of computational reproducibility, each highlighting unique challenges in methodology, tooling, and data accessibility.
In the life sciences,
More targeted investigations, such as that by
Despite the breadth and scale of these efforts, existing reproducibility datasets suffer from several key limitations. First, they are typically static snapshots, lacking mechanisms for continuous updates or dynamic enrichment. Second, their structure and metadata are often insufficiently standardized or semantically rich to support advanced queries, making integration with federated sources like Wikidata or DBpedia difficult. Third, most datasets are not optimized for AI applications, failing to offer the multimodal annotations or granularity needed for training or evaluating models on reproducibility tasks.
While no prior efforts have produced comprehensive knowledge graphs specifically focused on Jupyter notebooks or their computational reproducibility, there has been substantial progress in adjacent areas (
One of the most relevant contributions is the Open Research Knowledge Graph (ORKG) (
Some technical foundations for structuring code exist in projects like Graph4Code (
Despite these advancements, a significant gap persists: no existing work has constructed a semantically rich, large-scale KG, an AI-ready corpus derived specifically from reproducibility assessments of Jupyter notebooks. Such a resource would uniquely bridge symbolic representations (e.g., entities, dependencies, provenance) with statistical views (e.g., embeddings) and potentially enrich this further with execution traces of computational workflows.
Several works have also explored datasets centered on Jupyter notebooks, primarily targeting code generation or comprehension tasks, thereby limiting their applicability to maintenance-focused scenarios (
In contrast, FAIRJupyter4AI focuses specifically on reproducibility-related repairs, creating a dedicated benchmark for error detection and correction in research notebooks. Unlike prior work on general maintenance, our benchmark will target issues such as missing dependencies, broken paths, or environment drift, supporting the development and evaluation of AI models that restore and preserve computational reproducibility in scientific workflows.
Preliminary work
This proposal builds upon our prior research (
Our pipeline tested the reproducibility of these notebooks at scale, employing Conda-based environment reconstruction followed by notebook execution. Out of the total corpus, 10,388 notebooks were executed, with only 1,203 completing without errors, and a mere 879 producing output identical to the original. This highlighted critical issues such as undocumented dependencies, missing data files, and environment drift. We categorized and analyzed the most common failure modes—including ModuleNotFoundError, ImportError, and FileNotFoundError—and observed patterns across journals, research fields, and article types. These findings underscored the fragility and variance of reproducibility across published computational artifacts. The dataset is publicly available (
To provide structured access to these insights, we developed the FAIR Jupyter Knowledge Graph (
FAIRJupyter4AI builds directly upon this groundwork to deliver a sustainable, extensible, and impactful platform for reproducible and AI-enhanced computational science.
The primary objective of FAIRJupyter4AI is to develop a comprehensive, large-scale, and AI-ready corpus of computational notebooks that transcends existing reproducibility knowledge graphs by integrating the full multimodal content of Jupyter notebooks. This corpus will include executable code, narrative markdown, outputs (plots, tables), and associated data files, all meticulously cleansed, semantically annotated, and enriched to enable advanced AI-driven analysis and interaction. With an initial focus on Python, the corpus will be gradually expanded to include notebooks coded in other languages.
Key goals include:
The proposed project is planned for a total duration of 24 months. This time frame is appropriate to achieve the scientific objectives and complete the planned work packages given the requested resources and the use of Jupyter4NFDI infrastructure.
Fig.
The modular workflow for the construction of the FAIRJupyter4AI dataset will be based on that originally used for the construction of the FAIR Jupyter dataset (
The FAIRJupyter4AI work programme is organized into the following interconnected work packages (WPs) (Fig.
WP1: Data Collection and Curation (Lead: FIZ)
T1.1: Collect 50k+ research-related Jupyter notebooks (3 PM). The existing FAIR Jupyter dataset (
T1.2: Clean notebooks & annotate with licensing/ provenance metadata (3 PM). In this task, we will process the notebooks to add (cases 1 and 2) and remove (cases 3 and 4) some kinds of information as follows: (1) In order to build a corpus of Jupyter notebooks, the provenance and licensing terms need to be tracked (or extracted from README or CFF files and similar contexts) and compliance with licensing terms needs to be ensured. (2) In order to assess reproducibility, dependencies need to be tracked and taken into account, ideally in a recursive fashion (
T1.3: Extend the FAIR Jupyter Ontology to suit this broader corpus (1 PM). The existing FAIR Jupyter knowledge graph is using the FAIR Jupyter Ontology (
T1.4: Assess legal and ethical issues of (re)sharing Jupyter notebooks (2 PM). In the process of collecting, analyzing and annotating the notebooks, we will come across issues that stand in the way of including the notebooks (individually or at scale, whether redacted or not) in corpora for AI. Likewise, questions around the enrichment of the notebooks with related information - e.g. about associated papers, authors, or conferences - will be explored, as well as the sharing of use case-specific subsets, or requests by third parties for inclusion or non-inclusion of specific notebooks or reproducibility attempts. In this task, such issues will be categorized, quantified and analyzed such that the analysis can inform the cleaning workflows in T1.2 in a way that is compliant with the FAIR and the CARE principles. In doing so, we will build on past activities regarding license tagging (
M1.1 (Month 6): Pipeline successfully deployed on Jupyter4NFDI infrastructure. Deploying ReproduceMeGit - the core component of our reproducibility pipeline - to analyze the reproducibility of individual notebooks from NFDI workflows is a Deliverable (by Q3/ 2026, overseen by FIZ) in the Integration Phase of Jupyter4NFDI. We will expand on that and use the Jupyter4NFDI infrastructure to deploy our full ingestion and reproducibility pipelines (which are themselves a set of Jupyter notebooks) and further develop them there for FAIRJupyter4AI purposes.
M1.2 (Month 12): Licensing information available for 20k+ notebooks. Sharing notebooks without clear licensing information essentially prevents their inclusion in AI corpora. Having 20k+ notebooks with clear licensing information thus represents an important milestone in corpus construction, particularly if the licensing terms are compatible with inclusion in such a corpus.
D1.1 (Month 3): Public metadata schema and ingestion pipeline. One prerequisite to make the corpus useful is to have a clear and standards-aligned metadata schema. We will share this early on in order to invite community feedback. Alongside, we will share the pipeline that ingests notebooks and their metadata.
D1.2 (Month 11): Curated metadata corpus of 50k+ notebooks. Besides licensing, other metadata (e.g. programming languages, dependencies) are essential for reproducibility-focused use cases.
D1.3 (Month 14): Report on legal and ethical issues related to (re)sharing Jupyter notebooks at scale. This will summarize the findings from T1.4 and provide recommendations on what notebooks and metadata to include in our corpus, and how (technically as well as in terms of policies/ governance).
D1.4 (Month 24): Public full-text corpus of 20k+ research-related Jupyter notebooks. The notebooks included in this full-text corpus will be a subset of those included in the metadata corpus of D1.2, filtered for availability of licensing information and for compatibility of the license with inclusion in a full-text corpus. Whenever possible, we will also archive these notebooks via Software Heritage.
WP2: Reproducibility Assessment (Lead: TUC)
T2.1: Adapt the Python testing pipeline to work with Docker (3 PM). The existing FAIR Jupyter pipeline did not analyze containerized notebooks. Here, we will adapt the workflow such that (1) notebooks shared with Docker containers can be handled and (2) Python notebooks assessed as fully reproducible can be containerized via Docker. In addition, running the pipeline again for notebooks we already assessed in the past allows to assess reproducibility decay. A likely starting point for these Docker efforts will be Repo2Docker (
T2.2: Adapt the testing pipeline to work with R notebooks (3 PM). The existing pipeline covered notebooks in a number of languages other than Python but did not analyze their reproducibility. In this task, we will demo the use of the pipeline for non-Python notebooks by adapting it to notebooks written in R. To this end, we will review and leverage prior efforts regarding the reproducibility of R-based workflows, such as the R package flowR and the project SocEnRep .
T2.3: Develop a non-binary "reproducibility score" metric (1 PM). The existing pipeline kept track of whether required dependencies were indicated, whether they could be successfully installed, whether the notebooks ran through (and if so, whether their results matched the original ones) or whether they caused an exception, and in what cell. These kinds of information - along with relevant changes over time - could be combined into a more holistic reproducibility score, in line with the idea (
T2.4: Create visualization tools (1 PM). Here, we will build on the existing static visualizations of our initial corpus and on queries for the existing knowledge graph to assist human users of our pipeline and the corpus to visualize aspects relevant to them in a more interactive fashion, including at runtime. We will also explore visualization-like approaches for machines (e.g. structured JSON or matrix formats) and mechanisms to automatically signal information related to reproducibility attempts and results.
M2.1 (Month 9): ReproduceMeGit supports Docker. ReproduceMeGit (
M2.2 (Month 18): ReproduceMeGit with R support and public dashboard. Likewise, providing R support and dashboard functionality for our reproducibility pipeline requires adjustments to the core component, ReproduceMeGit.
D2.1 (Month 10): Dataset of 10k+ scored Python notebooks (e.g. execution logs, error types). The existing pipeline (
D2.2 (Month 11): Dataset of 1k+ scored non-Python notebooks (e.g. execution logs, error types). The existing pipeline did not assess non-Python notebooks, and their scoring might pose some additional challenges, so a smaller corpus seems to be an appropriate start here.
D2.3 (Month 21): Interactive dashboard (hosted demo). The variety of facets that potential users of our corpus might want to explore is large, so we will highlight some key facets in a hosted demo. This will include runtime characteristics not covered so far.
WP3: Knowledge Graph (KG) Development (Lead:FIZ)
T3.1: Extend FAIR Jupyter KG with additional information (4 PM). In line with the changes to the ingestion and assessment pipelines and the FAIR Jupyter Ontology as well as due to the application of the pipelines to new notebooks, the FAIR Jupyter KG would need to be adapted accordingly. For instance, the existing workflows have only partially taken into account notebook execution traces, keeping track of cell execution order and of outputs but not of kernel states during runtime, which we would now like to include. Likewise, decisions would have to be made about whether and how fixed versions of low-reproducibility notebooks would be included in the graph.
T3.2: Integrate embeddings (e.g., code2vec) for cross-notebook search (1 PM). The existing KG supports SPARQL queries, which can yield good results for searches aligned with the ontology. For similarity-based searches, SPARQL is usually not a good choice, but embeddings often are. Here, it likely makes sense to compute similarity separately for code, Markdown, outputs and errors. In addition to standard text-based approaches, code-aware embeddings like code2vec (
T3.3: Link to Wikidata/ORKG (publications, libraries) (1 PM). The FAIR Jupyter KG has some overlap with other KGs. For instance, Wikidata and/ or ORKG might have information about the paper associated with a notebook, and federated queries based on the paper’s DOI can in principle find out whether that is the case, but this approach is not very user-friendly. The same goes for other kinds of information, e.g. the libraries or datasets used in a notebook might also have Wikidata entries. In this task, we thus plan to enrich entries in the FAIR Jupyter KG with links to corresponding entries in other KGs, particularly Wikidata and ORKG.
T3.4: Release APIs for corpus/KG access (1 PM). The existing FAIR Jupyter KG can already be accessed programmatically but the planned changes to the KG imply changes to programmatic access and its documentation. Likewise, the corpus itself should be accessible programmatically, which this task will ensure.
M3.1 (Month 12): KG v1.0 (metadata only). The FAIR Jupyter KG has all the features deemed necessary for metadata corpus construction, along with a useful amount of content and associated documentation.
M3.2 (Month 21): Full multimodal KG release. At this point, the full text of suitably licensed notebooks is also represented in the KG, at least by way of embeddings.
D3.1 (Month 16): Hybrid KG (RDF + embeddings) with SPARQL endpoint. We provide at least one way in which RDF and embeddings can be used together to search by both semantic and similarity criteria.
D3.2: (Month 19) Public REST API. Both the KG and the corpus are programmatically accessible via RESTful APIs that cater to use cases identified through community engagement.
WP4: AI Model Training (Lead:TUC)
T4.1: Train LLMs (e.g., CodeLlama) on corpus for code generation (5 PM). Once the corpus reaches certain thresholds in terms of size, quality and feature richness, it can be used for experimentation with AI models. While general-purpose models like Llama might be useful for the Markdown part of notebooks (e.g. to improve textual documentation), more specialized ones like CodeLlama (
T4.2: Benchmark models on key tasks like error repair (5 PM). Some tasks that we expect to be popular amongst users of our corpus include (1) error repair, debugging or other improvements to a notebook (or its environment) to increase its reproducibility score, (2) cross-language refactoring of code, (3) code suggestions or code completion, (4) assessment of the correspondence between notebook content and the relevant sections of associated publications. For a set of tasks like these and for a range of models, we will set up benchmarking workflows. These workflows will include tests on robustness against hallucinations (e.g. as per
T4.3: Prototype synthetic notebook generator (privacy-safe) (3 PM). When notebooks and/ or associated data are not readily shared (for whatever reason), synthetic notebooks and/ or data might provide a useful avenue to explore some key aspects nonetheless. For instance, if the methods section of a paper claims to have processed certain types of data in a certain way to yield certain results, then that text can serve as the basis for prompts to create notebooks (perhaps along with suitable data files) that allow to explore computational and data aspects of that methodology. In this task, we will explore several scenarios for synthetic notebooks, and how our corpus and suitable models can assist with them. Suitably designed synthetic notebooks can also be used to augment training data and reduce biases.
M4.1 (Month 15): Initial model for code completion. Based on the model exploration activities, we will share a model that performs well on common code completion tasks in research-related Jupyter contexts, and invite community feedback.
M4.2 (Month 24): Full benchmarking suite. For a range of tasks and benchmarking workflows, we provide results and enable others to submit theirs.
D4.1 (Month 17): Hugging Face repository with five models. As Hugging Face is a common, popular and robust platform for sharing AI models and corpora, we will use it as one way to reach out to the community of potential FAIRJupyter4AI users.
D4.2 (Month 22): Benchmarking toolkit (leaderboard for reproducibility tasks). Here, we will share the setup we are using for running our reproducibility benchmarks. The setup is expected to be readily adaptable to other tasks and other data corpora (e.g. subsets of our corpus).
WP5: Communication, Community & Sustainability (Lead: FIZ)
T5.1: Project-internal communication (4 PM). The project involves complex technical work in a rapidly changing landscape, to be performed by individuals yet to be hired into two institutions, under the supervision of two PIs experienced in this space. In such a setting, good and regular internal communication is paramount. To this end, we will share workspaces, progress and problems on an ongoing basis whenever possible.
T5.2: Monitor resource usage (e.g. CPU, memory, cores) (2 PM). This task has several facets: (1) execution traces also depend on the underlying hardware, and the use of resources like memory, CPUs and cores might have direct implications for reproducibility, which we want to track and include in the reproducibility assessment and its representation in the ontology, KG and corpus, (2) for certain types of resources (e.g. TPUs or large RAM), high reproducibility scores might only be achievable on specialized infrastructure, the merits of which would need to be explored via community engagement, especially since (3) the environmental footprints of AI models (both during training and inference) as well as of large datasets or large-scale reproducibility assessments are considerable. Furthermore, (4) the scalability of our own workflows relates to a balance between resource use and resource availability.
T5.3: Outreach (incl. 3 workshops, nanopublications) (2 PM). Our ingestion and reproducibility pipelines have been developed in the open from the beginning with lots of community feedback through channels like GitHub, JupyterCon or review requests. We want to continue working in this open fashion - e.g. publishing this proposal is another step in this direction - and expect that this approach will allow us to leverage community insights whenever we are not sure how to proceed. In addition to that, we plan to run a series of workshops for the project that are inspired by the ReproHack, Carpentry and NFDI communities and aimed at (1) assisting in the design of the corpus and (2) promoting its use by tool developers, educators and other stakeholders. Last but not least, we intend to experiment with nanopublications as a way to communicate the results of reproducibility assessments and generate additional interest in our workflows.
T5.4: Dissemination (e.g. publish best practice guide) (2 PM). With our methodology, code and data being open already, what remains to be communicated are essentially the insights we get in the process. We have done this for the initial pipeline construction (
M5.1 (Month 8): First workshop. At that point, we expect to have running infrastructure, a good amount of data and some useful ideas about what could realistically be done with the corpus using contemporary models. This seems like a good set of circumstances to engage with the community of potential users in a workshop. The feedback we expect to get there can help us finetune some key aspects of the project like features of the corpus or of the associated ingestion, assessment or benchmarking workflows.
D5.1 (Month 9): Workshop materials are published. Most of these materials will actually be made public beforehand but we want to incorporate some community feedback into these materials, so as to have a better basis for the other workshops.
D5.2 (Month 23): Preprint on the FAIRJupyter4AI corpus. A narrative summary of our approach, with pointers to data, code and other associated materials, as well as with highlights of what we learned, how others are using the corpus or our workflows, and what the next steps might look like.
This structured approach ensures FAIRJupyter4AI delivers a continuously evolving, high-quality, and ethically sound resource that advances AI-driven reproducibility and research transparency at scale.
Fig.
FAIRJupyter4AI Project Plan: Work Packages, Tasks, Deliverables, Milestones, and Timeline. This Gantt chart provides a 24-month project schedule, detailing the progression of work packages and individual tasks. The person-month allocation for Doctoral Researcher 1 (DR1) and Doctoral Researcher 2 (DR2) is shown on the right. Tasks marked with an asterisk (*) highlight involvement and contributions from student assistants. Key deliverables (Dx.y) and milestones (Mx.y) are indicated at their respective completion points.
The FAIRJupyter4AI project faces several risks across technical, legal, community, and resource domains. Technical challenges include scalability issues in the reproducibility pipeline and its integration into the knowledge graph, mitigated through modular design and adherence to existing standards. AI model underperformance is addressed by leveraging pretrained models and hybrid approaches. Legal risks, such as licensing conflicts and data bias, are managed through automated filtering and stratified sampling, while privacy concerns are minimized with scrubbing tools. Community engagement risks, including low adoption or tool complexity, are countered with partnerships and user-friendly interfaces. Resource risks like personnel turnover or infrastructure delays are mitigated by workflow documentation and written commitments from providers. A contingency budget is reserved for potential problem areas like emergency cloud costs, legal reviews or community bounty initiatives, ensuring adaptability.
No facet of the proposed research is expected to trigger a requirement for formal ethical review. Nonetheless, we will carefully consider relevant ethical aspects throughout the project. Our dataset, FAIRJupyter4AI, is based on publicly available content from PubMed Central and other research repositories, which may contain personally identifiable information or other sensitive information, as well as different types of biases that would impact AI use. In WP1, we will evaluate these issues (T1.4) and address them in our processing pipeline (T1.2). We also plan to include information on the environmental footprint of our analyses to promote transparency and accountability. These measures are intended to ensure that our study of reproducibility in AI respects ethical principles and contributes constructively to open science.
Generative AI tools (e.g., Deepseek and Gemini) were used for language editing and refinement of certain sections of this proposal. The scientific content remains the responsibility of the authors.
We recognize the environmental implications of large-scale computational experiments and have therefore incorporated metadata related to the environmental footprint of notebook execution—such as estimated energy consumption and hardware utilization—to support the documentation of resource use as a step towards more sustainable digital research practices.
We will implement comprehensive measures for the management, storage, and dissemination of all results and data generated during the project. All research data, including the FAIRJupyter4AI corpus, metadata, and analysis outputs, will be stored using institutional repositories that support long-term preservation and persistent identifiers (e.g., DOIs). Workflows developed in this project will not only be used to generate the FAIRJupyter4AI corpus but they will be incorporated into the Jupyter4NFDI service portfolio and thereby be accessible to NFDI consortia and through NFDI to the wider research community. Where legally and ethically permissible, we plan to publish our metadata corpus under CC0 and our full-text corpus such that only openly licensed notebooks are included, to ensure accessibility and reusability by the scientific community. In addition, we will make all code and workflows available via GitHub and archive them via Zenodo. Documentation and metadata will follow the FAIR principles (
Publications resulting from the project and any relevant documentation will be available via open access, making them widely accessible for use by third parties. The source code for the software developed under the project will be documented in accordance with established standards, licensed with an open-source license, and made available for use by third parties free of charge.
Mietchen, Daniel - Postdoctoral Researcher (permanent)
Samuel, Sheeba - Postdoctoral Researcher (permanent)
Fodstad, Franziska - Secretary at the Department of Mathematics, FIZ Karlsruhe: She will provide administrative support for this project.
Herklotz, Dajana - Secretary at the Institute of Computer Science, Chemnitz University of Technology: She will provide administrative support for this project.
This project is currently not funded and no funding proposal has been previously submitted for it.
TUC and FIZ Karlsruhe will provide essential workplace equipment and infrastructure, including IT support, computing facilities, and collaborative tools. In addition, Jupyter4NFDI is providing infrastructure for hosting and managing the FAIRJupyter4AI workflows, including storage, computing, and long-term availability mechanisms. Dr. Mietchen and Dr. Samuel, funded by FIZ and TUC, respectively, will each contribute approximately five hours per week to the project, providing expert supervision to the doctoral researchers and support staff, and ensuring close coordination across partners.
Requested funding for Staff:
Doctoral Researcher (Wissenschaftliche Mitarbeiter; Table
|
Name |
Staff |
Quantity |
Months |
Sum (€) |
Applicant |
|
N.N |
Doctoral Researcher or Comparable (100%) |
1 |
24 |
163,200 |
Mietchen, Daniel |
|
N.N |
Doctoral Researcher or Comparable (100%) |
1 |
24 |
163,200 |
Samuel, Sheeba |
|
Total |
326400 |
||||
Support Staff (Studentische/Wissenschaftliche Hilfskräfte; Table
|
Name |
Staff |
Quantity |
Hours per Month |
Sum (€) |
Applicant |
|
N.N |
Support Staff |
1 |
80 |
20000 |
Mietchen, Daniel |
|
N.N |
Support Staff |
1 |
80 |
20000 |
Samuel, Sheeba |
|
Total |
40000 |
||||
Job Description of staff (requested funds):
N.N. DR1 (Doctoral Researcher 1, 100%) will primarily work on WP3: Knowledge Graph (KG) Development and WP4: AI Model Training. In WP1, they will be involved in collecting and cleaning notebooks. For WP3, their main focus will be on extending the FAIR Jupyter KG with additional information (T3.1), integrating embeddings for cross-notebook search (T3.2), and linking to external knowledge bases like Wikidata/ORKG (T3.3). This researcher will also contribute to WP2's reproducibility pipeline (T2.1, T2.2, & T2.4) and project-internal communication (T5.1), and outreach (T5.3). For WP4, they will lead the training of LLMs (T4.1), benchmarking models on error repair (T4.2), and prototyping the synthetic notebook generator (T4.3).
N.N. DR2 (Doctoral Researcher 2, 100%) will primarily focus on WP1: Data Collection & Curation and WP2: Reproducibility Assessment. In WP1, they will be heavily involved in collecting and cleaning notebooks, and critically, extending the FAIR Jupyter Ontology (T1.3). In WP2, their core responsibilities include adapting the Python and R testing pipelines to work with Docker (T2.1) and R notebooks (T2.2), and developing the non-binary "reproducibility score" metric (T2.3).
Support staff will primarily focus on providing crucial assistance across all work packages, particularly in data-intensive and foundational tasks. Their main contributions will be in WP1: Data Collection & Curation, specifically in the collection (T1.1), cleaning, and annotation of notebooks (T1.2). They will also support WP2: Reproducibility Assessment by assisting with the adaptation of the Python and R testing pipelines (T2.1, T2.2) and with WP4: AI Model Training in tasks like LLM training (T4.1) and benchmarking (T4.2). They will be actively involved in WP5: Communication, Community & Sustainability through project-internal communication (T5.1) and outreach activities including workshops (T5.3).
Requested funding for direct costs:
Travel Costs:
We request travel funds to support dissemination, outreach, and coordination activities that are essential to the success of the FAIRJupyter4AI project (Table
|
Funding for |
Sum (€) |
Applicant |
|
Travel |
10,000 |
Mietchen, Daniel |
|
7000 |
Samuel, Sheeba |
Table
|
Funding for |
Sum (€) |
Applicant |
|
Publications |
1500 |
Mietchen, Daniel |
|
1500 |
Samuel, Sheeba |
Other Modules:
Project Specific Workshops
We request a total of 10,000 EUR to support the organization of three project-specific workshops over the two-year funding period (Table
|
Funding for |
Sum (€) |
Applicant |
|
Project Specific Workshops |
5000 |
Mietchen, Daniel |
|
Project Specific Workshops |
5000 |
Samuel, Sheeba |
Requested funding for instrumentation
None
Total funding requested
Table
LIS Funding Programme or Call: Data Corpora for Artificial Intelligence (
FIZ Karlsruhe Leibniz Institute for Information Infrastructure (FIZ Karlsruhe)
Chemnitz University of Technology, Chemnitz
Daniel Mietchen and Sheeba Samuel contributed equally to this proposal.