GSoC 2026
Genome Assembly and Annotation

Genome Assembly and Annotation — Project Ideas

Public event/Ext Seminar event | Other

EMBL-EBI is a global leader in biological data. We develop and maintain open data resources and open-source software that support life science research worldwide. Our teams work at the intersection of biology, data science, and software engineering, building tools used daily by researchers across academia, healthcare, and industry.

Through Google Summer of Code (GSoC), EMBL-EBI mentors contributors to work on real-world open-source projects, helping them develop technical skills, domain knowledge, and experience contributing to widely used scientific software. The project ideas listed below reflect the breadth of work across EMBL-EBI and are designed to support contributors at different experience levels.

How to apply

Google Summer of Code contributors apply directly through the GSoC platform, but we strongly encourage you to engage with EMBL-EBI before submitting your application.

Step 1: Explore the project ideasReview the project ideas listed below and identify one (or more) that match your interests and skills. Each project includes information about expected outcomes, required skills, and difficulty level.

Step 2: Do some background reading You are not expected to be an expert in the domain, but spending some time familiarising yourself with the relevant technologies, data resources, or scientific context will help you prepare a stronger application.

Step 3: Get in touch with us early Once you have a project in mind, contact our GSoC helpdesk (

  • a short CV or link to relevant experience (e.g. GitHub, portfolio)
  • the project you are interested in
  • a brief explanation of why you are interested
  • any specific questions you may have

If you are interested in proposing your own project idea, include a short description and the technologies you expect to use so we can assess mentor availability.

Step 4: Draft your application Prepare your application well ahead of the deadline. A strong proposal clearly explains:

  • what you plan to build
  • how you will approach the work
  • a realistic timeline with milestones
  • what you hope to learn during the project

Mentors can provide guidance and feedback, but they will not write the application for you.

Step 5: Incorporate feedback and submit Use any feedback provided to refine your proposal, then submit your final application via the official GSoC website before the deadline.

For more detailed advice, please see our Contributor Guide, which outlines what we look for in successful applications and how contributors engage with EMBL-EBI teams during the programme.

Expose a subset of ENA REST Services as MCPCreating a knowledge graph from a subset of ENA and BioSamples dataAnnotation metrics reporting & analysis modules for the Ensembl Assembly/Annotation tracking appDevelopment of a refinement tool to identify selenoproteins in Ensembl genesetsExpanding a pipeline for small non-coding RNA (sncRNA) identification in Ensembl GenomesExpand genome metadata in Ensembl with AI toolsBUSCO-Missing Investigator (BMI): a reproducible pipeline to explain “missing/fragmented BUSCOs”Ask VEPai. Trained chatbot interface for Ensembl VEP web.nf-core/vep: Extending and standardising Ensembl VEP workflow for nf-coreBuilding a perturbation-aware LLM for multimodalin-silicoperturbation modellingDesigning an open access Ensembl GraphQL WorkshopStandardised evaluation for microbiome dataset classifiersSequence similarity networks for the visualisation and exploration of MGnify ProteinsA genomic feature database in the browserDesign and API-aware UI generation using MCP servers and Figma APIsExpose BioSamples Submission and Search Capabilities as MCP Tools for AI-Assisted Metadata Interaction

Brief Explanation

The European Nucleotide Archive (ENA) provides a rich set of REST APIs that allow users to query genomic metadata, sequence records, and submission information. While these APIs are powerful, they are not directly accessible to modern AI agents or LLM-based tools that rely on standardized interaction protocols.

This project aims to expose a carefully selected subset of ENA REST services through the Model Context Protocol (MCP), enabling AI agents to interact with ENA programmatically in a safe, structured, and reproducible way. MCP acts as a bridge between large language models and external tools by defining explicit schemas, inputs, and outputs, preventing hallucinations and ensuring reliable access to authoritative data sources.

The student will design and implement an MCP server that wraps ENA REST endpoints (e.g. study metadata lookup, run/sample queries, accession search) and exposes them as well-defined MCP tools. The project focuses on correctness, usability, and extensibility rather than deep bioinformatics analysis.

This project is intentionally scoped to be beginner-friendly, with limited bioinformatics background required, and emphasizes software engineering, API design, and AI-tool integration.

Expected results

  • A production-ready MCP server exposing a broad subset of ENA REST APIs

  • Demonstrations of:

  • LLM or agent-based querying of ENA via MCP

  • Deterministic, traceable responses backed by ENA data

  • A scalable foundation that can be extended to other EMBL-EBI data resources

  • Comprehensive documentation, including:

  • Architecture and design decisions

  • MCP tool specifications, Examples of AI agent workflows using ENA MCP tools

  • Support for advanced capabilities such as:

  • Composing multiple ENA queries into a single logical operation

  • Normalizing heterogeneous ENA responses into consistent formats

  • Caching and rate-limit-aware request handling

  • Well-designed MCP tool schemas with:

  • Strong input validation, explicit output contracts, clear error semantics

Required knowledge

  • Strong Python programming skills
  • REST APIs, HTTP, and JSON
  • Software architecture and modular design
  • Schema definition and validation

Desirable knowledge

  • Familiarity with MCP or LLM tool/function calling
  • API performance optimization and caching strategies
  • Experience with containerization (Docker) or CI/CD pipelines

Difficulty

Medium

Length

Medium – 175h

Mentors

Brief explanation

European Nucleotide Archive (ENA), one of the three major nucleotide databases in the world, is hosting over 70 PB of genomics data. LLMs are well-developed to parse unstructured data but less so with the structured data.

This project is to create a prototype of a knowledge graph (KG) to make the database directly accessible to AI tools. A graph engine will be integrated with the existing structured data store to avoid duplicating data into a graph database. An AI-friendly Graph Query Language (GQL) will be used to interact with the KG backed by the relational data model via the graph engine dynamically. High profile LLM models are to be evaluated to generate GQL statements. The final output will be one or more AI agents to support Graph RAG to interact with a subset of the structured genomics data in ENA with the following characteristics:

  • A working prototype capable of querying a small subset of ENA data (e.g. pathogen, AMR or data deposition analytics).
  • A clear path to scale up the prototype to expand to all structured data in ENA.
  • AI agent(s) with absolutely no hallucination.

Expected results

  • Students would learn how AI components are used to construct agent workflows.
  • Students would gain firsthand experience how to create working prototypes beyond “hello-world” toys.
  • Students would be able to create standalone AI agents capable of interacting with ENA data but with minimum dependency.
  • Students would be able to apply the knowledge learned in the summer school to create Graph RAGs on any structured data.

Required knowledge

  • AI-friendly GQL (e.g. Gremlin)
  • Graph engine (e.g. PuppyGraph)
  • Python and libraries for AI-agent construction (e.g. LangChain)
  • Methodology for benchmarking Graph RAG and GQL

Desirable knowledge

  • ENA schema and tagging mechanism
  • Kubernetes and its scaling
  • Scalable local deployment of LLM models (e.g. Ollama)

Difficulty: High

Length: 350h

Mentors: [David Yuan](mailto: davidyuan@ebi.ac.uk)

Brief explanation

Ensembl maintains an internal web application to track genome assembly status (e.g. candidates for annotation), Ensembl annotation status, and associated quality/completeness metrics. While the app stores important annotation completeness scores and other quantitative measures, it currently lacks richer reporting and comparative views that help users quickly interpret genome annotation quality across many species.

This project focuses on designing and implementing Python-based analysis modules for genome annotation metrics and comparative analysis, with a strong emphasis on clean workflows, test coverage, and maintainability. These modules will generate per-genome reports and perform comparative analyses across taxonomic groupings, enabling annotators and production teams to identify unusual annotations, trends, and priorities in a reproducible and testable way.

There is an opportunity to integrate the resulting modules into the existing tracking web application, but web/UI integration is not essential to the core project. The primary goal is to produce robust, well-tested backend analysis components that can later be surfaced via the app or reused in other contexts (e.g. batch reporting, pipelines).

Expected results

Deliverable 1: Per-genome annotation metrics report module

  • Develop a Python module that generates a structured metrics report for a single genome, based on the existing data model.

  • The report should include (as available in the stored data model), for example:

  • annotation completeness scores (and/or component sub-scores)

  • number of protein-coding genes/transcripts

  • exon counts and distribution summaries (e.g. exons per transcript)

  • gene/transcript length distributions or summary stats

  • any other tracked QC/production metrics already stored

  • Include clear tables and a small set of “at-a-glance” visuals (e.g. sparklines, histograms, boxplots, score badges).

  • Unit tests for individual metric calculations

  • Integration tests for full per-genome report generation

  • Clear separation between data access, computation, and presentation layers

Deliverable 2: Taxonomy grouping + comparative analysis module

  • Add functionality to group genomes by taxonomic classification at multiple ranks (e.g. species/genus/family/order/class/phylum).

  • Provide controls to select:

  • taxonomic rank

  • comparison set (e.g., “all annotated in Ensembl release X”)

  • metrics to include in analysis

  • Implement multivariate analysis (MVA) to identify trends and outliers within each group, e.g.:

  • PCA (or similar dimensionality reduction)

  • clustering (optional, depending on scope and usefulness)

  • outlier detection heuristics (distance-based, robust z-scores, etc.)

  • Produce “nice visuals” suitable for production/QC workflows, such as:

  • PCA scatter with interactive point details (genome name, key metrics)

  • heatmaps/correlation views

  • rank-based comparison plots (e.g. boxplots per clade)

  • outlier summary list linking back to the genome report page

Final project output

  • A set of well-documented, reusable Python modules for:

  • per-genome annotation metric reporting

  • taxonomy-based comparative analysis

  • Comprehensive test suite covering core logic and analysis workflows

  • Clear documentation describing:

  • module data flow

  • how metrics are computed

  • how to extend the system with new metrics or analyses

  • Optional (stretch goal): example integration points or lightweight endpoints demonstrating how the modules could be plugged into the existing tracking web application

Required knowledge

  • Python for data handling and statistical analysis.
  • Basic understanding of genome annotations and common metrics (genes/transcripts/exons, completeness/QC measures).

Desirable knowledge

  • Experience with multivariate analysis (PCA, clustering) and practical outlier detection.
  • Familiarity with taxonomy sources/identifiers and rank-based grouping (NCBI taxonomy, etc.).
  • Experience producing clear scientific dashboards/visualisations.
  • Experience working with existing codebases and adding features in a maintainable way (tests, docs, code style).
  • Experience designing clean APIs and modular analysis code
  • Experience working in existing codebases with an emphasis on testing, documentation, and code quality
  • Familiarity with web application backends or data visualisation, for optional integration work (FastAPI)
  • Familiarity with web application frontends for optional integration work (React/ Node.js) - Interest in product design and design tools (ex. Figma)

Difficulty: Beginner

Length: 175h

Mentors: Anna Lazar, Simarpreet Kaur Bhurji, Leanne Haggerty

Brief explanation

Selenocysteine-containing proteins (selenoproteins) play crucial biological roles, but their annotation remains challenging due to the unique incorporation of selenocysteine (Sec, U) at UGA codons. Currently, Ensembl uses Exonerate to align known selenoproteins to genomes and manually verifies models based on sequence identity and coverage. However, the existing approach is inefficient and outdated, requiring a more scalable and automated solution.

This project will develop a Nextflow pipeline to efficiently annotate selenoproteins that can be applied to Ensembl gene sets by:

  • Optimising the search for selenoprotein homologs

  • Aligning known selenoproteins against the genome using more efficient tools like MMseqs2, DIAMOND, or TBLASTN.

  • Filtering candidate regions based on sequence similarity, focusing on high-identity and high-coverage matches.

  • Improving selenocysteine validation

  • Detecting UGA codons in aligned models and verifying the presence of SECIS elements (selenocysteine insertion sequences) in downstream regions.

  • Ensuring selenocysteine positions match the reference protein sequences.

  • Automated filtering and quality control

  • Retaining only models with expected coverage and sequence identity to known selenoproteins.

  • Benchmarking against accurate but computationally intensive dedicated selenoprotein annotation tools

  • Benchmarking against accurate but computationally intensive dedicated selenoprotein annotation tools

  • Removing false positives by integrating BUSCO-like completeness scoring with clade specific selenoprotein sets.

  • Generating quality assessment reports.

  • Deployability and scalability

  • Implementing the pipeline in Nextflow to improve reproducibility and scalability across multiple genomes.

  • Providing Docker/Singularity containers for easy deployment in HPC and cloud environments.

The final pipeline will be integrated within Ensembl’s genome annotation pipeline to be integrated within Ensembl’s genome annotation pipeline to streamline selenoprotein identificationidentification, thus thus improving accuracy, efficiency, and automation.

Expected results

  • A Nextflow-based selenoprotein annotation pipeline that aligns known selenoproteins and predicts valid selenocysteine-containing models.
  • Automated verification of UGA codons and SECIS elements.
  • Integration-ready outputs for Ensembl gene sets.
  • Containerised workflow for deployment on multiple computing environments.

Required knowledge

  • Nextflow or similar workflow automation tools.
  • Sequence alignment tools (DIAMOND, MMseqs2, TBLASTN, Exonerate).
  • Genome annotation formats (FASTA, GFF3).
  • Basic knowledge of selenoprotein biology and SECIS elements.

Desirable knowledge

  • Experience with gene annotation pipelines (e.g., AUGUSTUS, BRAKER, HELIXER, HELIXER).
  • RNA structure analysis tools for SECIS detection (e.g., SECISearch, Infernal).
  • BUSCO or other completeness assessment tools.
  • Containerisation technologies (Docker, Singularity).

Difficulty: Medium

Length: 175h

Mentors: Jack Tierney

Brief explanation

This project aims to enhance an existing pipeline for identifying small non-coding RNAs (sncRNAs) in Ensembl genomes. Building on the current MirMachine modules, the pipeline will be expanded to incorporate additional analyses using RFAM and miRBase databases.

Further improvements will include running sequence similarity searches with NCBI-BLAST and generating structural models using the Infernal software suite. The final pipeline will be optimised for flexibility, supporting various input sources, and containerised using Docker/Singularity to ensure reproducibility and shareability.

Expected results

  • Integration of RFAM and miRBase data for improved sncRNA annotation.
  • Incorporation of NCBI-BLAST for sequence similarity searches.
  • Implementation of Infernal for RNA structural model generation.
  • Optimisation of pipeline scalability and flexibility for different input sources.
  • Containerisation of the pipeline using Docker/Singularity for easy deployment.
  • Documentation and testing to ensure usability and reproducibility.

Required knowledge

  • NextFlow or other workflow management tools.
  • Python and/or Bash for pipeline scripting.
  • Basic RNA bioinformatics (FASTA, GFF3 formats, RNA databases).

Desirable knowledge

  • Experience with RFAM, miRBase, and NCBI-BLAST.
  • Familiarity with Infernal for RNA secondary structure modeling.
  • Knowledge of Docker/Singularity for workflow containerisation.
  • Experience in workflow optimisation for large-scale genomic data.

Difficulty: Medium

Length: 175h

Mentors: Jose Perez-Silva, Vianey Paola Barrera Enriquez

Brief explanation

Ensembl Plants and Ensembl Metazoa import publicly available genome assemblies and their annotations from community contributors. Whilst assemblies are submitted to INSDC sequence archives, it is often the case that these submissions are missing some key information that can usually be found in the paper publication corresponding to that assembly (most frequently due to those metadata fields being not available in the submission process). This metadata is not useful useful for our users, but Ensembl can benefit from it, e.g. polyploid genomes require different processing parameters/tools than diploid genomes when it comes to comparative genomics. Current AI tools are making fetching such metadata from research papers much easier, so we would like to build a standalone module that performs such task with the ultimate goal to incorporate it into our genome loading pipeline.

Expected results

  • A standalone module (preferrably written in Python, but any current bioinformatics programming language would be acceptable) that can fetch the required genome metadata from current (publicly available) literature
  • The code will include documentation as well as type hints (if the selected programming language allows it) and unit testing
  • Capacity to retrain/expand as new research papers are published

Required knowledge

  • AI tools for fetching/querying a database of research papers
  • Build of a module to be later included as part of a production pipeline written in Nextflow

Desirable knowledge

  • Familiarity with the metadata associated with invertebrate and/or plant assemblies

Difficulty: Medium

Length: 175h

Mentors: Jorge Alvarez, Disha Lodha

Brief description

BUSCO is widely used to measure assembly completeness, but after seeing “Missing” (and often “Fragmented”) BUSCOs, users still need to answer: why are these genes missing and what should I do next?

This project proposes a reproducible, best-practice pipeline/tool that takes BUSCO outputs (and optionally assemblies/annotations/reads) and automatically gathers evidence to generate interpretable, ranked explanations per BUSCO along with a clean, actionable report.

Expected results

  • Summary table: BUSCO ID → status, top reason code(s), confidence scores
  • Report (HTML/Markdown): overview plots + top actionable recommendations (e.g., “try a closer lineage dataset”, “investigate contig ends”, “annotation rerun suggested”, “coverage drop suggests gap”)

Required skills

  • Command line + Linux basics
  • Python (or similar) for parsing, feature extraction, and reporting
  • Genomics basics (assemblies, gene models, alignments)
  • Reproducibility practices

Desirable skills

  • Familiarity with alignment outputs and scoring
  • Workflow engineering (Nextflow)
  • Experience packaging bioinformatics tools and writing robust docs/tests

Difficulty: Medium

Length: 175 hours

Mentors: Swati Sinha, Jitender Jit Singh Cheema

Brief explanation

Ensembl VEP is a widely used tool (10+ million dockerhub pulls alone) enabling the annotation and prioritisation of genetic variants and is used extensively in academic research and clinical assessments.

This project would be to prototype an AI chatbot configuration interface for the version of Ensembl VEP run from the new Ensembl website. The current selection of options for running the web version of Ensembl VEP is extensive, and requires users to be experienced or willing to read lots of tooltips and help documentation. A better option would be if they were able to describe their data and what they’re trying to achieve then receive a set of suggested options, with justifications. They could then click to apply these to the configuration before Ensembl runs.

Each Ensembl VEP option would be assessed and labelled and weighted appropriately. We would then identify an appropriate base chat-bot model and assemble a corpus of training data, from a mixture of our responses to users and specific constructed examples. These would be divided into training and test sets for first training the model and then assessing responses.

If this is completed, an optional extension of the project would be to produce a simple API wrapper for IO.

Expected results

  • Assemble and label training data
  • Train a prototype model
  • (Optional) API wrapper

Required knowledge

  • Python (and ML libraries)
  • Data annotation
  • Desirable knowledge
  • HPC interaction
  • Model training experience
  • Prompt engineering
  • Understanding / interest in genetic variant annotation

Learning outcomes

Gain experience with data annotation and agent model training and testing, supporting a globally utilised genetic variant annotation tool.

Difficulty: Medium

Length: 175 hours

Mentors: Likhitha Surapaneni

Brief explanation

The goal of this project is to design, develop, and document an nf-core pipeline for the Ensembl Variant Effect Predictor (VEP) that follows nf-core best practices, fully modularizes the existing Nextflow VEP workflow from the Ensembl repository, providing required testing and continuous integration. This project will bring the Nextflow VEP workflow inline with nf-core standards, providing greater usability for the community.

The Ensembl VEP is a widely used variant annotation tool capable of producing rich functional annotations for genomic variants. It has been part of different bioinformatics workflows. A Nextflow workflow already exists that leverages nextflow parallel processing capabilities (e.g., splitting VCFs, parallelizing chromosome analysis, and merging results), but it is not packaged as an nf-core pipeline with the community standards around modularity, container support, automated testing, documentation, and configuration profiles.

Expected results

  • A repository containing the workflow following nf-core guidelines which needs to include –

  • Required modules and workflows

  • Nextflow configurations profile for different executor environments

  • Easy-to-follow and standard documentation

  • Publishing the workflow in nf-core

Required knowledge

  • Strong Python programming skills
  • Nextflow core concepts

Desirable knowledge

  • Basic understanding of HPC environments
  • Experience working with Ensembl VEP
  • Familiarity with unit testing and CI/CD
  • Familiarity with Groovy and scripting languages such as bash

Learning outcomes

Enhanced understanding of the structure and workflows required for production pipelines. Appreciation of community standards implementation and the generation of reliable, repeatable and reusable workflows.

Difficulty: Medium

Length: 175 hours

Mentors: Syed Hossain

Brief explanation

Recent advances in single-cell foundation models and perturbation-driven datasets are bringing the concept of a “virtual cell” closer to reality. However, most current models remain siloed by modality (CRISPR screens, MAVE, scPerturb-seq) and lack a unifying layer that can integrate causal perturbation knowledge across data types.

In this project, the student will build a prototype perturbation-aware large language model (LLM) by fine-tuning an existing open-source model on curated perturbation datasets from the Perturbation Catalogue. The goal is not to train a foundation model from scratch, but to explore how LLMs can act as a knowledge-integration layer that connects genetic perturbations, variants, and single-cell responses.

The project directly supports the emerging “lab-in-the-loop” and scPerturb-seq Atlas concepts, where models guide experimental design and hypothesis generation by predicting cellular responses to unseen perturbations. The student will prototype workflows for:

  • Encoding perturbation experiments into LLM-friendly representations
  • Integrating multiple modalities (CRISPR screens, MAVE, scPerturb-seq)
  • Evaluating how well an LLM can support reasoning over causal perturbation data

This will position the Perturbation Catalogue as a core resource for next-generation in silico perturbation modelling and virtual cell development.

Expected results

By the end of the project, the student will deliver:

  • A curated multimodal training corpus derived from the Perturbation Catalogue, including:

  • CRISPR screen summaries

  • MAVE variant–effect annotations

  • scPerturb-seq perturbation–response profiles

  • A fine-tuned perturbation-aware LLM prototype capable of:

  • Answering structured questions about perturbation effects

  • Reasoning across modalities (e.g. linking variant effects to transcriptional responses)

  • Supporting simple in silico perturbation queries (e.g. “What happens if gene X is knocked out in cell type Y?”)

  • Benchmarking and evaluation framework, comparing:

  • LLM-based reasoning vs simple baselines

  • Performance across perturbation regimes (seen vs unseen genes, cell types, variants)

  • A reproducible open-source pipeline, including:

  • Data preprocessing scripts

  • Fine-tuning notebooks/workflows

  • Documentation for future contributors

  • A short technical report and blog post describing how LLMs can support the “virtual cell” and lab-in-the-loop paradigms in perturbation biology.

Required knowledge

  • Strong Python programming skills
  • Basic machine-learning concepts (training, validation, overfitting)
  • Familiarity with deep-learning frameworks (PyTorch preferred)
  • Experience working with structured biological data (e.g. CSV/TSV, JSON, HDF5)
  • Background in computational biology or bioinformatics
  • Familiarity with single-cell data (scRNA-seq, perturb-seq concepts)
  • Experience with large language models and fine-tuning (e.g. HuggingFace ecosystem)

Desirable knowledge

  • Knowledge of causal inference or perturbation biology
  • Basic understanding of cloud or HPC environments

Difficulty: High

Length: 350h

Mentors: Alexey Sokolov, Kirill Tsukanov, Aleksandr Zakirov

Brief explanation

The Ensembl GraphQL service can be used to access information about genes, transcripts, assemblies and associated metadata held by Ensembl.This project will be conducted in collaboration with the Ensembl Outreach and Platform team to develop a freely available hands‑on, workshop teaching participants how to query Ensembl data using GraphQL. The workshop will include modules covering an introduction to GraphQL, schema exploration, query building, and techniques for error handling and debugging. As part of the project, the participant will create documentation with example prompts that can be used with AI assistants (e.g. Gemini) to help generate valid GraphQL queries or assist in debugging scripts. The workshop will be designed to be reproducible and easily extendable, enabling integration of future Ensembl GraphQL resources.

This experience will provide a mentored learning pathway focusing on practical software and data science skills, preparing the contributor for future open-source work.

Learning objectives

  • Understand the structure and functionality of Ensembl core GraphQL API, including its schema and queries.
  • Workshop design and educational resource development.
  • Gain foundational understanding of key Ensembl data entities such as genes, transcripts, assemblies and species.

Aims

  • Develop a teaching kit, including presentation slides, Jupyter notebooks with real world examples of exporting genomic data via Ensembl GraphQL.
  • Document all components comprehensively so that another trainer can run the workshop with minimal setup or additional development.
  • Design structured AI prompts that help participants use an AI assistant to construct accurate and efficient GraphQL queries.

Expected results

  • A robust and interactive Ensembl GraphQL training resource featuring example code, helper functions, and debugging documentation.
  • A transferable design adaptable to other Ensembl GraphQL resources in the future.

Required knowledge

  • Intermediate programming skills (preferably in Python), including HTTP requests, JSON handling, and basic packaging or testing workflows.
  • Experience in interacting with AI models (e.g. Gemini) including prompt design.

Desirable skills

  • Core genomics knowledge, such as genes, transcripts, variants, and species identifiers, sufficient to interpret Ensembl data.
  • Basic understanding of GraphQL, including schemas, queries, arguments, nesting, and executing GraphQL endpoints with POST requests

Difficulty: Medium

Length: 175 hours

Mentors: Aleena Mushtaq, Bilal El Houdaigui

Brief explanation

Accurate metadata is essential for interpreting and comparing microbiome datasets. Despite its importance, it often remains incomplete or inconsistent in life-science public repositories. Trapiche is a metadata classification tool for microbiome datasets that combines microbial composition (taxonomic) profiles with free text from project and sample descriptions. The base models can be repurposed for other classification tasks, but users currently lack a simple, standardised way to evaluate model quality and interpret results.

This project will develop an evaluation and reporting toolkit for Trapiche that automatically produces standardised metrics and human-readable reports. A key focus will be to monitor and compare the contribution of both input components: the taxonomic profiles and the text features. This will allow users to understand not only how well models perform, but also how each input type influences the predictions.

The resulting module will shorten development cycles for new microbiome classification tasks and support more reliable, comparable, and reusable life-science datasets.

Expected results

  • A Python evaluation module that computes standard classification metrics (accuracy, precision, recall, F1-score, confusion matrix).
  • Support for component-aware evaluation, reporting performance for text-only, taxonomy-only, and combined inputs.
  • An automated report generator producing HTML or PDF summaries with metrics and plots.
  • Documentation covering installation, usage, and interpretation of results.
  • A walk-through Jupyter notebook demonstrating the use of the module.

Required knowledge

  • Proficiency in Bash and Python.
  • Experience with data processing using Pandas and NumPy.
  • Familiarity with machine learning evaluation concepts and Scikit-learn metrics.
  • Experience with data visualisation tools such as Matplotlib or similar.

Desirable knowledge

  • Familiarity with version control (Git) and collaborative coding workflows
  • Fundamentals of metagenomics and its applications
  • Experience with natural language processing (NLP) methods

Difficulty: Medium

Length: 175 hours

Mentors: Santiago Fragoso, Mahfouz Shehu

Brief explanation

In this project, we aim to develop a prototype method for the generation of sequence similarity networks (SSNs) for the MGnify Proteins database to help enable graph-based analyses of its sequence space. Using tools like MMseqs2 to compute pairwise sequence similarities, and Python graph libraries like NetworkX, a collection of representative SSNs will be generated for a small subset of the database of ~10 million proteins. The nodes of these networks will then be annotated with relevant MGnify metadata, starting with biome of origin. Finally, we will generate visualisations of these annotated SSNs to be displayed on the MGnify Proteins website using modern graph rendering tools like Cytoscape and Cosmograph.

The latest release of the MGnify Proteins Database contains over 2.4 billion non-redundant protein records including relevant metagenomics metadata. The visualisation of sets of protein sequences using SSNs is a common approach for extracting novel insights about protein-protein relationships, including functional, structural, and evolutionary hypotheses. Facilitating the generation of SSNs for the MGnify Proteins database would therefore be a significant contribution to open metagenomics science.

Expected results

  • Develop a prototype workflow for generating SSNs for a given set of protein sequences using MMSeqs2
  • Apply the latter workflow to a workable subset of the MGnify Proteins representative clusters to generate SSN representations
  • Annotate generated SSNs with biome of origin
  • Generate visualisations for all biome-annotated SSNs

Required knowledge

  • Proficiency in Bash and Python
  • Comfortable with using a Unix shell
  • Basic git skills for version-control of work

Desirable knowledge

  • Familiar with graph theory and network analysis concepts
  • Experience with Python graph libraries like NetworkX
  • Experience with workflow design and implementation

Difficulty: Beginner

Length: 350 hours

Mentors: Christian Atallah

Brief explanation

Interactive data science web applications often need to support efficient search over large structured datasets, while keeping latency low and avoiding heavy server-side infrastructure. At large scale, this can be done in several ways, for example: 1) precompute an index file to accompany the dataset; 2) load records into a server-side indexed database behind a REST API; or 3) index and query data in the browser (e.g. using IndexedDB).

In bioinformatics, annotating (meta)genomes involves tagging regions of genomic sequences with feature details (like the location of a gene and its function). Computational pipelines produce these annotations and output standardised formats like GFF (General Feature Format) – effectively a TSV file for genomics. There are various ways to interrogate and visualise these annotations, including genome browsers like JBrowse. A frequent use case is to search the annotations by a query such as a function category label, and then browse to the matching locations in the sequences. Like any database, this becomes challenging for large datasets – in particular the metagenomes we analyse in MGnify become very large.

The objective of this project is to try a mixed approach: convert GFF (and other) files into a SQLite database using gffutils, creating extra database indexes at the same time. We would like to distribute this feature SQLite to the browser, and query it using the sqlite3 WASM in-browser capabilities to both display a feature search interface and pass data to JBrowse (perhaps via a new plugin).

Expected results

  • A python script that uses gffutils to produce a suitably indexed SQLite database from a large metagenome GFF genomic feature file.
  • Unit tests for the script.
  • A react javascript component that queries the SQLite file client-side, using sqlite3 WASM’s javascript API
  • Ideally: the ability to partially read the SQLite file from a remote server, using HTTP Range requests
  • Demonstrated integration with the JBrowse viewer, e.g. via a plugin

Required knowledge

  • Python
  • Python testing frameworks e.g. pytest
  • Javascript: React.js - Relational database concepts (e.g. SQL)
  • Version control

Desirable knowledge

  • Use of WASM (Web Assembly)
  • Database indexing
  • Bioinformatics file formats

Difficulty: Medium

Length: 175 hours

Mentor: Vikas Gupta

Brief explanation

MGnify’s web interfaces are built on a large and evolving API surface, with complex data relationships and established frontend patterns. Translating Figma designs into production-ready UI code currently requires significant manual effort, particularly when wiring components to backend endpoints and maintaining consistency across the application.

MGnify already has a prototype (Model Context Protocol) server that exposes tools backed by existing API endpoints. However, coverage is currently partial and focused on selected workflows.

This project proposes extending and integrating an existing MCP server prototype with the Figma API, enabling a design- and API-aware pipeline that assists developers in generating frontend components that are:

  • Grounded in authoritative Figma design artifacts (Visual Framework Assets)
  • Aware of MGnify’s API surface and response schemas
  • Aligned with existing frontend conventions, dependencies, and coding patterns

The system will act as developer-assist infrastructure, reducing repetitive boilerplate work and accelerating the design-to-implementation cycle while preserving full human control.

Expected results

By the end of the project, the student will deliver:

End-to-end proof of concept

  • Demonstrate generating a new or updated MGnify UI page (e.g. a detail or results page) from:

  • A Figma design

  • MCP-exposed API context

  • An integrated MCP server using a Client Side LLM Chat interface e.g Claude Desktop

Learning outcomes

Through this project, the student will gain experience in designing and implementing production-grade developer tooling that integrates design systems, APIs, and modern web frameworks. Specifically, the student will learn to:

  • Work with large, real-world APIs
  • Understand and extend an existing MCP server exposing a complex, evolving API surface
  • Reason about API schemas, relationships, pagination, and error handling
  • Design abstractions that remain stable as backend APIs evolve
  • Learn to integrate with external APIs
  • Responsibly build AI assisted developer tools

Required knowledge

  • JavaScript/TypeScript, Python
  • Experience with modern frontend frameworks (React preferred)
  • REST APIs and JSON schema interpretation
  • Git and collaborative software development workflows

Desirable knowledge

  • Familiarity with design systems and component libraries
  • Experience with Figma or design-to-code tooling
  • Backend development experience (Node.js or Python)
  • Interest in developer tooling and automation
  • Exposure to scientific or data-heavy web platforms

Difficulty: Medium

The project involves real-world system integration and design decisions, but is well-scoped and suitable for a student with solid web development fundamentals.

Length: 175 Hours

Mentors: Mahfouz Shehu

Brief explanation

The BioSamples database at EMBL-EBI provides a central repository for the storage, validation, and retrieval of biological sample metadata across a wide range of life science domains. BioSamples plays a critical role in ensuring that sample descriptions are structured, standards-compliant, and reusable across downstream archives such as ENA, ArrayExpress, and others.

Despite the availability of REST APIs for sample submission, validation, and search, these interfaces are not directly accessible to modern AI agents or large language model (LLM)–based systems, which require explicit schemas, deterministic interactions, and well-defined tool boundaries. As a result, the use of AI for assisting users in preparing high-quality BioSamples submissions or performing structured sample discovery remains limited.

This project aims to design and implement a BioSamples MCP server that exposes a carefully selected subset of BioSamples submission and search functionality through the Model Context Protocol (MCP). The system will enable AI agents to interact with BioSamples in a safe, structured, and reproducible manner, reducing metadata errors while improving usability for submitters and data consumers.

The project focuses on two complementary capabilities:

AI-assisted sample submissionfrom plain-text descriptions, with interactive clarification and validation against BioSamples checklists.Natural-language-driven sample search, converting free-text queries into structured BioSamples search requests.

By leveraging MCP’s explicit tool schemas, input/output contracts, and error semantics, the project prevents hallucinations, enforces metadata correctness, and ensures that all responses are traceable to authoritative BioSamples data sources.

The project is intentionally scoped to be beginner-friendly, requiring limited domain-specific bioinformatics knowledge, and emphasizes software engineering, API design, schema validation, and AI-tool integration rather than biological interpretation.

Project objectives

The primary objectives of the project are to:

  • Expose BioSamples submission and search functionality as MCP-compatible tools
  • Enable AI agents to assist users in creating valid, checklist-compliant BioSamples metadata
  • Provide deterministic, explainable responses grounded in BioSamples APIs
  • Demonstrate interactive clarification workflows for incomplete or invalid metadata
  • Establish a reusable MCP-based foundation that can be extended to other EMBL-EBI data resources

Scope and functionality

  1. AI-Assisted BioSamples Submission

The system will allow users to describe a biological sample using plain natural language, for example:

“Human liver biopsy collected in London in 2023 from a patient with cirrhosis.”

The MCP server, in combination with an LLM-based agent, will:

  • Extract candidate metadata fields from the text
  • Map extracted information to BioSamples attributes
  • Validate the resulting sample against a selected BioSamples checklist
  • Detect missing mandatory attributes (e.g. organism, material, collection date)
  • Detect missing or incomplete spatiotemporal metadata
  • Prompt the user with explicit clarification questions when required information is missing
  • Produce a fully structured BioSamples sample representation once all requirements are satisfied

The final output will be a validated, submission-ready BioSamples sample object, with all validation decisions and user interactions explicitly traceable.

  1. Natural-Language BioSamples Search

The project will also support plain-text search queries, such as:

“Human blood samples collected in Europe after 2020 related to diabetes.”

The system will:

  • Parse the free-text query into structured search criteria
  • Translate these criteria into BioSamples-compatible search filters
  • Execute the search via BioSamples APIs
  • Normalize and summarize the returned results into a consistent, human-readable format
  • Return accession identifiers and key metadata fields for downstream exploration

This enables AI agents to act as structured discovery interfaces while preserving the determinism and correctness of the underlying BioSamples queries.

Expected results

By the end of the project, the following deliverables are expected:

  • A near production-ready MCP server exposing a subset of BioSamples submission and search APIs

  • Demonstrations of:

  • AI-assisted, checklist-aware BioSamples submissions

  • Interactive clarification workflows for incomplete metadata

  • Natural-language-driven BioSamples search via MCP

  • Deterministic, auditable responses fully backed by BioSamples APIs

  • A scalable MCP-based architecture that can be extended to additional checklists or EMBL-EBI resources

Advanced capabilities (optional / stretch goals)

Depending on time and interest, the project may additionally explore:

  • Composing multiple BioSamples operations into a single logical workflow (e.g. validate → clarify → submit)
  • Normalizing heterogeneous BioSamples responses into a unified internal representation
  • Rate-limit-aware request handling and response caching
  • Clear error categorization (validation errors vs. system errors vs. user input issues)
  • Multi-turn conversational state management for submissions spanning multiple interactions

Documentation requirements

The project will include comprehensive documentation covering:

  • System architecture and design decisions
  • MCP tool definitions and schemas
  • Validation and clarification workflows
  • Example AI agent interactions for submission and search
  • Limitations, assumptions, and potential extensions

Required knowledge

  • Strong Java or Python programming skills
  • REST APIs, HTTP, and JSON
  • Software architecture and modular design
  • Schema definition and validation
  • Basic understanding of metadata modeling

Desirable knowledge

  • Familiarity with BioSamples or similar metadata repositories
  • Experience with MCP or LLM tool/function calling
  • API performance optimization and caching strategies
  • Experience with containerization (Docker) or CI/CD pipelines

Difficulty: Medium

Length: 175 hours

Mentor: Dipayan Gupta

To be successful with your application, it is important to demonstrate the following:

1. An understanding of the major aims of the project

We do not expect contributors to have expert domain knowledge at the outset. However, some light background reading on the proposed technologies and underlying science will help you better understand the project context and goals.

2. An ability to build on the project idea

We provide a set of project ideas as starting points. Strong applications go beyond simply restating the description and instead bring new ideas, questions, or alternative approaches that build on the initial outline.

3. Clear and appropriate communication with mentors

Engaging with potential mentors ahead of submitting an application is key to success. Mentors are available to answer questions and provide guidance, but they will not write your application for you.

If you need clarification or additional background, communicate this clearly and in good time. Be concise and specific in your questions. Last-minute requests for substantial feedback are generally a sign of poor planning.

4. A realistic and well-structured timeline

Although GSoC timelines are flexible and can sometimes be extended, the programme is still relatively short. A good application includes:

  • clearly defined milestones
  • deliverables aligned with GSoC evaluation periods
  • a workload that is realistic given your other commitments

We value sustainable working practices and do not expect contributors to work excessive hours. Availability and constraints should be clearly stated and discussed with mentors.

5. Genuine enthusiasm and engagement

Demonstrated interest in the project, the technologies involved, and working with EMBL-EBI teams goes a long way. Enthusiastic and engaged contributors tend to have more productive mentor relationships and more successful projects.

The steps below provide a general guide to submitting a strong application. While we publish a list of suggested projects, contributor-proposed ideas are also welcome.

Review the project ideasReview our GSoC project ideas page to explore available projects and their associated technologies.Select a project of interestRead the description carefully and do some light background research if needed.Get in touch with usContact our GSoC helpdesk (helpdesk@ensembl.org) with the subject line**“GSoC”**. Please include:- a short CV or link to relevant experience

  • a brief explanation of your interest in the project

  • any specific questions you may have

  • If you are proposing your own project idea, include a short description and the technologies you expect to use so we can assess mentor availability. Draft your application earlyPrepare a first draft well ahead of the deadline and share it with your mentor(s) or via the helpdesk for feedback.Incorporate feedback and finaliseUse the feedback provided to refine your proposal, then submit the final version once it has been reviewed.

GSoC contributors at EMBL-EBI are treated as members of their project teams. Contributors typically engage through:

  • Slack and mailing lists
  • GitHub issues and pull requests
  • regular meetings with mentors

Where time zones permit, contributors may also attend team or section-wide meetings and may be invited to present their work during the programme.

Good luck with your application. GSoC has consistently been a rewarding experience for both contributors and mentors at EMBL-EBI, and we look forward to supporting contributors in developing skills, gaining domain knowledge, and contributing to open scientific software.

GSoC Resources

EMBL-EBI resources and servicesWe develop and maintain a wide range of open biological data resources, including:

Ensembland thebeta Ensembl websiteEuropean Nucleotide Archive (ENA)BioStudiesMGnify (Metagenomics)

Code repositories

Date: 16 - 31 Mar 2026

Location: Virtual

Venue: Online

Related Pages

Command Palette

Search for a command to run...