GSoC 2026 Explorer

Google Summer of Code 2026: Internet Archive

Call for Proposals

Students: Learn more about submitting an effective Internet Archive GSoC application

Submit here: 2026 Program Internet Archive

Form: How to express interest & connect 2

The Internet Archive is a 501(c)(3) non-profit digital library which creates backups of the world’s cultural heritage, spanning everything from websites, government documents, scholarly research, books, music, television news, software, and more. Like a physical library, the Internet Archive provides free access to researchers, historians, scholars, persons with print disabilities, and the general public. Our motto is Universal Access to All Knowledge. The Internet Archive runs several initiatives including the Wayback Machine, OpenLibrary.org, Archive-it, and Archive.org. We’ve participated in GSoC for several years: 2025, 2024, 2023, 2021, 2020, 2019, 2018.

If you are interested in learning more about a proposed GSoC project, getting feedback* about your proposal, and/or connecting with a mentor, please use this form or click any of the pre-filled project links below to get in touch. In order to submit your final proposal, you will need to use the google summer of code website!

*We will not provide feedback on proposals that appear to have been predominantly generated by LLMs.

Mentors: Mark Graham, Director, the Wayback Machine

Dr. Sawood Alam, Research Lead, Wayback Machine

Will Howes, Engineer, Wayback Machine

Project 1: Wayback PDF Changes

Size: 350 hours
Difficulty: Hard
Description: Wayback Changes is a tool you can use to identify, and display, changes in the content of archives of URLs. Example: https://web.archive.org/web/diff/20260110134549/20260112035545/en.wikipedia.org. This currently supports only HTML pages. We would like to extend it to support PDF documents. - Outcome: The back-end we use to calculate the differences of two different captures of the same URL is https://github.com/edgi-govdata-archiving/web-monitoring-diff . We need to either extend this to support PDF or create a new software for PDF that uses the same web API. The front-end that displays the differences is https://github.com/internetarchive/wayback-diff . We need to extend it to support PDF diff rendering. - Skills: Python, JavaScript,

Project 2: Wayback Machine Extension With Client-side Built-in AI Prompt API Capabilities

Size: 350 hours
Difficulty: Hard
Description: Google Chrome web browser (version 138) has enabled general availability of the built-in Gemini Nano AI model on the client-side in extensions on supported devices ( https://developer.chrome.com/docs/ai/prompt-api). We would like to explore the possibilities of how these emerging capabilities can be leveraged in the context of the Wayback Machine to help our patrons utilize our services more effectively. Potential use-cases include assessment of temporal consistency of the playback of an archived capture, quality assurance of the rendered playback, soft-404 detection, summarization, context linking, etc. These are just a few examples, but we would like potential GSoC contributors to come up with their own creative ideas that they think would be helpful and plausible. - Outcome: Initial outcome would include an independent experimental add-on, with the eventual goal of incorporating selected functionalities into the official Wayback Machine extension ( https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak). - Requirements: Contributors need to have access to a computer that meets the hardware and software requirements to run Chrome with built-in AI as described in https://developer.chrome.com/docs/ai/prompt-api - Skills: Chrome Extensions, JavaScript, GenAI, HTML, HTTP

Project 3: Tapestry AI Enhancement

Size: 350 hours
Difficulty: Medium
Description*: The Tapestries project ( https://tapestries.archive.org,GitHub) currently allows users to connect their Google Gemini account to use a chat window to construct or analyze content in Tapestries. We’d like to enhance this capability: allowing the use of other services (including examining the possibility for self-hosted services) and deepening the reach into included documents of different types as well as different sources of data, particularly in the Internet Archive. Other ideas involving Tapestries are more than welcome! - Outcome: AI is more accessible and useful to Tapestries users allowing semi-automated creation of more interesting documents.
Skills: Node, TypeScript, some experience with MCP helpful

The Internet Archive's Open Library (https://openlibrary.org) is a non-profit book catalog that helps patrons across the globe discover and access millions of digital library books for free. In 2026, Open Library is focusing on the basics: fixing the platform’s core virtuous loop by improving book record quality scores to increase retention and driving participation to reinforce the platform’s value. In service of tuning this loop, Open Library proposes 2 opportunities. Other proposals will also be considered.

Mentor(s): Mek Karpeles Program Lead & Drini Cami Senior Engineer

Size: 350 hours
Difficulty: Medium
Description: One of our goals in 2026 is to improve the usefulness/value score of our book records. One way to do this is to take our millions of unstructured, overloaded subject tags and clean them up + organize them into specific, well defined categories/types (such as genre, moods, places, etc). This will require collaboratively building specs for different tag types, project management, processing millions of subject tag records, writing tools to safely and efficiently update book records without losing data.
Outcome: Hundreds of thousands of books become more discoverable and searchable by genre (at minimum) and it’s much easier for patrons to glean what is a book is about and whether it’s of interest to them.
Skills: Python, Big Data processing
Size: 350 hours
Difficulty: Medium
Description: Currently, patrons who register for Open Library often have a lonely and confusing experience. There’s no opportunity to import their books and easily leverage their past work, there isn’t a way to easily tell the platform what books or genres you enjoy, and your personalized homepage looks like an empty ghost town.
Outcome: Improve retention and participation by helping patrons learn how to use Open Library, easily import their books from elsewhere, see recent activity around the library, provide opportunities to follow people with complimentary tastes, and explore book recommendations.
Skills: Python, Javascript, Vue, Lit, Front-end engineering, UX, Product
Size: 350 hours
Difficulty: Medium
Description: The search page is outdated, doesn’t work well on mobile, and is not optimized for the way patrons want to find books. This effort explores how we can re-invent our facet bar, redesign search results, and incorporate full-text search to provide a better, more useful experience for readers, researchers, and those seeking book recommendations. There are also opportunities to tune the search algorithm itself to improve performance and accuracy for our common cases.
Outcome: A search experience that is useful to more patrons on more platforms
Skills: Javascript, HTML, Solr, Python, UX, front-end, search ranking, data-driven engineering

Extracting New Value from books

Size: 350 hours
Difficulty: Medium
Description: The Internet Archive has a collection of more than 8M texts that are only available for preview. How might we extract insights from these books to make book records and previews more useful to Open Library patrons?
Outcome: Propose and design a system that improves the usefulness and value of high traffic book records on Open Library using standard off-the-shelf tools and services, on a limited budget. Proposals should provide and be backed-up by clear cost bounds informed by actual reproducible examples/experiments.
Skills: LLMs, Agents, AI, Text-Mining

Given the fact the Internet Archive and the Wayback Machine are fairly well known projects, with broad missions, deep history, extensive information, a range of public services, and a staff of dozens of engineers who collaborate in a Slack workspace and other mediums…

What project would YOU like to suggest to us that might best fit your skills, experience and interest and our mission of advance goal of Universal Access to All Knowledge?

Mentors: Mark Graham, Director, the Wayback Machine

Dr. Sawood Alam, Research Lead, Wayback Machine

Skills required/preferred: Come with what you have!

Size: 350 hours.

Difficulty: Could be hard… up to you!

Internet Archive — Project Ideas

Command Palette