All I Want for Christmas is a Better Alt Text – Part 1
2025-12-15
Context: Improving Alt Text for Firefox
Earlier this year, I built the backend for the local alt text generation feature in Firefox. Nearly half of the images on the web still lack alternative text, creating a major accessibility barrier for screen reader users. The goal of this work is straightforward but ambitious: generate high-quality alt text entirely on device, preserving user privacy while improving access to visual content.
The first implementation focused on PDF.js, primarily as a controlled environment to validate the approach. Now that the runtime performance is good enough, the next step is to generalize this capability across the entire browser so that all web images can benefit from meaningful descriptions. Before that generalization, however, improving accuracy is essential.
From a modeling perspective, the system pairs a Vision Transformer (ViT) with DistilGPT-2, a 182-million-parameter language model that fits under 200 MB once quantized. Improving this system involves multiple, often competing dimensions: bias reduction, description accuracy, and inference speed. This post focuses on the data side of the problem, specifically dataset quality and bias. Part 2 will look at model-level improvements for accuracy and performance.
First Round: Removing Bias with GPT-4o
The original image captions contained several recurring issues:
- Gender bias: skateboarders described as “men”, nurses as “women”
- Age stereotyping: unnecessary or reductive age descriptors
- Offensive or outdated language: culturally insensitive terms that no longer belong in a modern dataset
To address this, I used GPT-4o to systematically transform captions from Flickr30k and COCO, removing demographic descriptors that were not visually required. The resulting datasets are available on Hugging Face (Mozilla/flickr30k-transformed-captions-gpt4o) and were used to train the current Firefox local alt text model.
For more background on this initial effort, see the Mozilla Hacks post and the Firefox blog announcement. This is the model that is currently shipping in Firefox.
Second Round: Measuring What Actually Improved
Qualitative panel testing showed that the transformed captions were generally better received by humans, but that only answered part of the question. What exactly improved, by how much, and what problems remained hidden in the data?
This post documents the second round of work, which focused on building systematic measurement tools to:
- Quantify how much bias was actually removed
- Verify that transformed captions still describe the images accurately
- Identify class imbalance and other structural issues
- Lay the groundwork for targeted fixes, including synthetic data generation
When training vision-language models, dataset quality is often treated as a secondary concern compared to architecture or training tricks. In practice, the data is the foundation. If the dataset is biased, noisy, or unbalanced, no amount of fine-tuning will fully compensate.
The Problem Space
After the GPT-4o transformation, several open questions remained:
- Did bias removal actually work in a measurable way?
- Was semantic meaning preserved during transformation?
- Did image–text alignment degrade or improve?
- Are some visual concepts severely underrepresented?
- Can these checks be repeated reliably for future dataset versions?
Answering these questions requires more than a single score or benchmark.
A Multi-Metric Quality Analysis
I built a dataset quality analysis tool that evaluates four complementary dimensions. The emphasis is on improving the training data itself, rather than compensating for data issues at model time.
1. Image–Text Alignment (CLIP Score)
CLIP provides a convenient proxy for how well a caption matches its corresponding image. By embedding both modalities and computing cosine similarity, I obtain a rough but useful alignment score.
A key improvement in this round was upgrading from CLIP ViT-B/32 to ViT-L/14 @ 336 px. The larger model produces lower absolute scores, but it is significantly more discriminative, making it easier to separate strong alignments from weak ones.
Interpretation guidelines:
- Excellent: ≥ 0.35
- Good: 0.30–0.35
- Fair: 0.25–0.30
- Poor: < 0.25
On the transformed dataset, I observe scores of 0.311 with ViT-B/32 (Good) and 0.284 with ViT-L/14 @ 336 px (Fair but more informative).
2. Caption Fidelity (BERTScore)
Removing bias should not come at the cost of semantic drift. To verify this, I used BERTScore with a RoBERTa-large backbone to compare original and transformed captions.
Scores above 0.90 generally indicate that the core meaning is preserved. The transformed dataset achieves 0.904, which falls comfortably in the “excellent” range.
3. Bias Detection Before and After
Bias reduction is only meaningful if it can be measured. I tracked mentions of protected attributes across seven categories, including gender, race or ethnicity, nationality, age, religion, sexual orientation, and disability.
By comparing original and transformed captions on the same samples, I can directly quantify the effect of the transformation. On a 1 000-sample evaluation set, gender mentions dropped from 70 percent to zero, race and ethnicity mentions dropped by 97 percent, and nationality mentions were completely eliminated. Age-related terms remain more common, largely because they are often visually relevant, for example when describing children.
4. Object Distribution and Imbalance
Finally, I analyzed object frequency to identify long-tail problems. Using metrics such as the Gini coefficient and Shannon entropy, the tool highlights severe imbalance: thousands of objects appear only a handful of times.
This analysis automatically produces lists of rare objects and sampling weights that can be used for rebalancing during training.
Using CLIP as a Training Signal
Beyond evaluation, CLIP can also be used to guide training directly. I experimented with a combined loss that adds a CLIP-based alignment term to the standard cross-entropy loss for caption generation.
The intuition is simple: encourage the model to generate captions that are not only fluent, but also visually grounded. Early results suggest modest but consistent gains in CLIP score, at the cost of slower training and higher memory usage.
Running the Quality Analysis
The quality analysis tool integrates directly into the project’s Makefile:
# Quick test (100 samples)
make quality-report-quick
# Full analysis on test split
make quality-report SPLIT=test
# Custom analysis
make quality-report SPLIT=train MAX_SAMPLES=1000 OUTPUT_DIR=./my_reports
Example Dataset Quality Report
Below is an excerpt from the generated quality report for the full Flickr30k transformed dataset. It illustrates how the metrics come together in practice.
================================================================================
DATASET QUALITY REPORT
================================================================================
Dataset: Mozilla/flickr30k-transformed-captions-gpt4o
Samples: 31 014
IMAGE–TEXT ALIGNMENT (CLIP)
Score: 0.274 ± 0.036 Assessment: FAIR
CAPTION FIDELITY (BERTScore)
Score: 0.899 ± 0.023 Assessment: GOOD
BIAS DETECTION (Original → Transformed)
Gender: 67% → 0%
Race/Ethnicity: 27% → 1%
Nationality: 1% → 0%
Age: 19% → 17%
OBJECT DISTRIBUTION
Gini coefficient: 0.889
Rare classes (<50 samples): 6 210
================================================================================
The report confirms that the GPT-4o transformation is highly effective at removing demographic bias while preserving meaning. At the same time, it surfaces two remaining issues: only fair image–text alignment and severe class imbalance.
Output Files
The analysis produces the following artifacts:
Directory: quality_reports/
• summary.json - Aggregate metrics in JSON format
• quality_report.txt - Human-readable summary report
• per_example_scores.csv - Per-sample CLIP, BERT, and bias scores
• ranked_by_combined.csv - Samples ranked by combined quality score
• object_counts.csv - Object frequency distribution
• objects_below_50.csv - Rare / underrepresented objects (≤50 samples)
• reweighting_probs.csv - Sampling probabilities for balanced training
• lorenz_curve.png - Object distribution inequality visualization
• top_failures/ - Top failure cases with images and captions
These artifacts make it easy to audit dataset quality, compare runs, and target specific weaknesses.
Key Takeaways
- Dataset quality cannot be captured by a single metric
- Bias removal can be measured and verified quantitatively
- Larger CLIP models are more useful for discrimination, even if absolute scores are lower
- Alignment-aware training objectives show promise
- Class imbalance remains a major, and solvable, issue
What Comes Next
None of these improvements are shipping yet. They are preparatory steps that make future work safer and more predictable. With solid metrics in place, the next phase is to train improved models, validate gains rigorously, and continue reducing long-tail failures.
The long-term goal remains unchanged: provide high-quality, privacy-preserving alt text for the large fraction of web images that still lack it, and do so in a way that is fair, transparent, and measurable.
References and Resources
Background
- Experimenting with local alt text generation in Firefox Nightly
- Help us improve our alt text generation model
Datasets
Metrics
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- BERTScore: Evaluating Text Generation with BERT
- Gini coefficient
- Shannon entropy