> Stabilizing, Completing, and Upstreaming the Hindi DBpedia IE Pipeline
DESCRIPTION
During GSoC 2025, a comprehensive Hindi Information Extraction (IE) pipeline was developed for DBpedia, covering SLM-guided triplet extraction, IndIE enhancement, link prediction, predicate linking to the DBpedia ontology, and SPARQL endpoint deployment. While the core functionality has been implemented and evaluated, several critical components remain unfinished or unmerged, preventing the pipeline from being production-ready and fully integrated into DBpedia’s main infrastructure.
This project focuses on completing, stabilizing, and upstreaming the Hindi IE pipeline developed in GSoC 2025. The work emphasizes finalizing pending technical components (notably type/isA predicate linking, mappings integration, and finetuning), improving robustness and reproducibility, cleaning overlaps across modules, and preparing the existing pull requests and infrastructure for merge into the main DBpedia repositories.
Rather than introducing an entirely new pipeline, the project consolidates and strengthens a substantial existing contribution, ensuring long-term usability, maintainability, and extensibility of Hindi DBpedia.
GOAL
Complete pending GSoC 2025 tasks, including:
Predicate linking for type (isA / rdf:type) relations, where objects are ontology classes rather than properties
Finalizing Hindi mappings via the DBpedia mappings UI
Optional finetuning of Gemma-3 using the existing filtered synthetic dataset
Improving error handling and logging across stages (IndIE, LLM-IE, predicate linking)
Ensuring reproducible runs using clean configuration, caching, and setup instructions
Reducing code duplication across IndIE, ReAct, and llm_IE modules
Prepare and upstream existing work by:
Cleaning and finalizing pending PRs (e.g., neural-extraction-framework#20, extraction-framework#776)
Aligning the GSoC25_H fork with DBpedia’s main repository structure
Enable production readiness by:
Preparing deployment-ready SPARQL setup for a permanent Hindi DBpedia endpoint
Documenting deployment, usage, and troubleshooting for maintainers and contributors
Lower the entry barrier for new contributors through clear documentation and well-defined extension points.
IMPACT
Immediate impact
Delivers a stable, merge-ready Hindi IE pipeline for DBpedia
Enables consistent extraction, ontology linking, and querying over Hindi Wikipedia
Community impact
Reduces technical debt accumulated across multiple GSoC cycles
Makes Hindi DBpedia easier to maintain and extend by future contributors
Sustainability
Improves reproducibility and robustness for low-resource language IE pipelines
Establishes a cleaner foundation for multilingual expansion beyond Hindi
Research & practice
Strengthens reproducible evaluation for SLM-based IE and hybrid rule/LLM pipelines
Improves ontology alignment for non-English knowledge graphs
WARM-UP TASKS
Clone and run the complete Hindi IE pipeline (GSoC25_H) end-to-end
Reproduce published results on the Hindi-BenchIE dataset and report deviations
Execute predicate linking for both property and type (rdf:type) relations and analyze failure cases
Identify and fix at least one concrete issue (bug, logging gap, or documentation flaw)
Submit a small but meaningful PR (documentation, error handling, or predicate linking improvement)
PROJECT SIZE
175 hours (Medium)
KEYWORDS
Information Extraction, Knowledge Graphs, Hindi NLP, DBpedia, Predicate Linking,
Ontology Alignment, Low-Resource Languages, Pipeline Stabilization, SPARQL, Reproducibility