Building the Model Behind PDF.js Alt Text
2025-09-01
This was the very first end-to-end machine learning project with modern LLMs in Firefox: building alt text generation our PDF.js viewer in Firefox. It became the foundation for many of the AI features we are now bringing into the browser.
In this post, I will not talk about the runtime inside Firefox, but instead focus on the model itself. I will share the journey step by step: from fine-tuning the model and addressing biased data, to building a validation app and a feedback loop that grew into a full pipeline for continuous improvement.
Starting from an Existing Model
I began with a pre-trained ViT + GPT-2 model—compact enough to run locally, yet effective for generating image descriptions. The reason I picked this model initially was because when we first started experimenting with AI in Firefox, transformers.js was the runtime we initialy chose. It came with an example project for image-to-text that used a very similar architecture, so it was a natural starting point.
My first idea was to make the model even smaller so it could run more easily in the browser. I swapped out the GPT-2 text decoder with a distilled version and then retrained it. For training, I used popular datasets like COCO and Flickr-30k, and at first the results looked promising. The model worked surprisingly well given its compact size.
Wrestling with Biased Data
But soon I hit some important limitations. The human annotations in those datasets often carried biases: gender stereotypes (“a man riding a skateboard” even when the subject was not clearly a man), or descriptions that leaned into cultural assumptions. In some cases, I even noticed fatphobic language creeping into the labels. On top of that, the smaller architecture meant the model sometimes produced clunky or imprecise captions.
Rebuilding Better Data
To address this, I focused on creating a cleaner dataset derived from COCO and Flickr-30k. I wanted to preserve the diversity of images but replace the biased human labels with synthetic, one-sentence descriptions generated by a large language model.
At first, we experimented with doing this manually, but it quickly became clear that large models, when prompted carefully, could produce captions that were reliable, inexpensive, and incredibly fast to generate. The resulting annotations were shorter, more neutral, and far better suited for a smaller architecture. This new dataset helped reduce bias and improve accuracy, giving me a strong foundation for the next rounds of fine-tuning.
Building a Human-in-the-Loop App
A major milestone was building a small app for humans to validate model output—and feed corrections back into retraining. In the app, users see the generated alt-text on many random images, make edits, and those corrections become new or corrected training examples. The validation workflow was simple but transformative: it automates dataset improvement, guiding each fine-tune round.
This setup might sound similar to RLHF (Reinforcement Learning with Human Feedback), but it is not quite the same. RLHF uses human preferences to train a reward model that guides reinforcement learning. What I built is simpler: a human-in-the-loop supervised pipeline. People correct the outputs directly, and those corrections go straight into the next training dataset. It is less complex than RLHF but highly effective for improving the model in a practical way.
Iterative Fine-Tuning
Armed with better data and validation feedback, I ran multiple rounds of fine-tuning. The improvements were obvious: the alt-text became more accurate, more neutral, and overall more useful. Still, I had to keep iterating to deal with class imbalance, the problem where some categories of images were overrepresented in the dataset, leading the model to favor them more than others. Balancing this was key to making the model more reliable across all kinds of content.
What I Learned
- Fine-tuning is just the start: the quality of training data matters more than model size.
- Bias creeps in quickly: annotation norms reflect cultural assumptions, and that affects models.
- Human feedback is gold: validating and correcting output closes the loop, improving the model with real-world input.
- Building the full pipeline is powerful: from fine-tune to validation to retrain, a full cycle lets me evolve the model with each iteration.
Where It Lives Now
The outcome of this journey now lives in PDF.js. Firefox can generate alt-text for PDFs locally and privately, with the model running directly on-device so your data never leaves your machine. There is still room to grow: further fine-tuning, tackling class imbalance more systematically, and extending the system to multiple languages. But as a first complete end-to-end project, this was a strong foundation and a clear proof of what is possible.
On a personal note, this project showed me how much of AI work is really about iteration, patience, and learning from mistakes. It was a reminder that the real progress comes not from one perfect model, but from building the right process to keep improving it.
Useful links
- Hugging Face: The trained model
- Hugging Face: The Flickr30 dataset with cleaner alt text
- GitHub: The code to train the model
- GitHub: The app to validate model output
- GitHub: The PDF.js project
- transformers.js
- Mozilla Blog: Help us improve our alt-text generation model
- Mozilla Hacks: experimenting with Local Alt-Text Generation in Firefox Nightly