<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://tarekziade.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://tarekziade.github.io/" rel="alternate" type="text/html" /><updated>2026-02-18T21:11:38+00:00</updated><id>https://tarekziade.github.io/feed.xml</id><title type="html">Tarek Ziadé</title><subtitle>Notes on Building Sofware</subtitle><entry><title type="html">Running SmolLM-135M in rustnn with flexible inputs</title><link href="https://tarekziade.github.io/2026/02/18/running-smollm-135m-in-rustnn/" rel="alternate" type="text/html" title="Running SmolLM-135M in rustnn with flexible inputs" /><published>2026-02-18T00:00:00+00:00</published><updated>2026-02-18T00:00:00+00:00</updated><id>https://tarekziade.github.io/2026/02/18/running-smollm-135m-in-rustnn</id><content type="html" xml:base="https://tarekziade.github.io/2026/02/18/running-smollm-135m-in-rustnn/"><![CDATA[<p>WebNN is emerging as a portable, browser-friendly inference API.
But LLMs hit a hard wall: <strong>dynamic inputs</strong>.</p>

<p>Autoregressive transformers fundamentally mutate state at runtime. KV cache
tensors evolve at every step, sequence lengths vary with prompts, and shape
expressions flow through
operators like <code class="language-plaintext highlighter-rouge">Shape</code>, <code class="language-plaintext highlighter-rouge">Gather</code>, <code class="language-plaintext highlighter-rouge">Concat</code>, <code class="language-plaintext highlighter-rouge">Reshape</code>, and <code class="language-plaintext highlighter-rouge">Expand</code>.</p>

<p>Today, this does not map cleanly to WebNN’s static-graph constraints.</p>

<p>At step 1, KV cache length is 1. At step 512, KV cache length is 512.
That is not a static graph.</p>

<h2 id="why-this-matters">Why this matters</h2>

<p>If this is not solved, WebNN stays limited to fixed-shape demos for many LLM
use cases.</p>

<p>For real product workloads, people need variable prompt lengths, efficient
token-by-token decode, and stable latency as context grows. Without that, local
browser LLM UX degrades quickly and teams default back to heavier alternatives.</p>

<h2 id="dynamic-inputs-in-practice">Dynamic inputs in practice</h2>

<p>This problem is now well documented in the WebNN WG in
<a href="https://github.com/webmachinelearning/webnn/issues/883">Issue #883: Support flexible input sizes</a>.</p>

<p>The issue captures exactly what we see in practice:</p>

<ul>
  <li>Vision models with runtime-determined resolution</li>
  <li>Speech/decoder models where KV cache grows by one token per step</li>
  <li>LLMs with arbitrary prompt lengths and dynamic cache shapes</li>
</ul>

<p>In other words: modern inference workloads.</p>

<h2 id="the-onnx-runtime-workaround-and-why-it-hurts">The ONNX Runtime workaround (and why it hurts)</h2>

<p>ONNX Runtime WebNN has had to work around this limitation by routing dynamic-shape
parts away from WebNN and into WASM execution paths.</p>

<p>It works, but performance is terrible for autoregressive generation because you
bounce between fast backend code and slow fallback code in the hottest part of
the loop.</p>

<p>This architecture can make demos pass, but it creates significant performance
penalties in real autoregressive workloads.</p>

<p>In preliminary runs, keeping decode on one backend avoids the repeated fallback
round-trips that dominate token latency.</p>

<p>So instead of accepting that, we decided to push WebNN support further.</p>

<h2 id="i-started-this-in-malta">I started this in Malta</h2>

<p>I started prototyping this while on vacation in Malta.</p>

<!-- Malta photo goes here -->

<p>What began as a small experiment quickly turned into deep changes across three
repositories: converter internals, runtime shape validation, and KV-cache
plumbing.</p>

<p>The work happened across:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">webnn-graph</code>: ONNX lowering and dynamic-input metadata support</li>
  <li><code class="language-plaintext highlighter-rouge">rustnn</code> (<code class="language-plaintext highlighter-rouge">tarek-flexible-input</code>): dynamic dimensions in graph/runtime + checked execution</li>
  <li><code class="language-plaintext highlighter-rouge">pywebnn</code>: Python demos and loading-from-Hub workflows</li>
</ul>

<p>I also made sure to surface <code class="language-plaintext highlighter-rouge">rustnn</code> through Python (<code class="language-plaintext highlighter-rouge">pywebnn</code>) very early.</p>

<p>Most ML engineers live in Python land, and I wanted this work to be reachable by
that ecosystem immediately: easier model validation, easier parity checks against
<code class="language-plaintext highlighter-rouge">transformers</code>, and faster feedback from people who already ship models every day.</p>

<h2 id="what-changed-in-practice">What changed in practice</h2>

<p>The key was to support <strong>bounded dynamic dimensions</strong> end to end.</p>

<p>Why bounded and not fully dynamic? Because many backends still need strong
compile-time guarantees for validation, allocation, and kernel planning.
Fully unbounded shapes are hard to optimize and hard to validate safely.</p>

<p>Bounded dynamic dimensions are the practical compromise: keep symbolic runtime
flexibility, but define a maximum range so memory and execution planning remain
deterministic.</p>

<p>This allows the full autoregressive decode loop to stay inside the WebNN
backend, without bouncing into slower fallback paths.</p>

<p>This is also better than common alternatives:</p>

<ul>
  <li>Padding everything to worst-case shapes wastes memory and compute</li>
  <li>Re-exporting one graph per shape explodes complexity</li>
  <li>Falling back dynamic parts to WASM in hot decode loops kills throughput</li>
</ul>

<p>For example, you can bound a sequence dimension to 2048 tokens: large enough
for real prompts, still finite for backend planning and allocation.</p>

<p>In <code class="language-plaintext highlighter-rouge">rustnn</code>, tensor dimensions can now be static values or dynamic descriptors
with a name and a max size, then checked at runtime:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"inputs"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"x"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"dataType"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"shape"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"batch"</span><span class="p">,</span><span class="w"> </span><span class="nl">"maxSize"</span><span class="p">:</span><span class="w"> </span><span class="mi">16</span><span class="w"> </span><span class="p">},</span><span class="w">
        </span><span class="mi">128</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">webnn-graph</code>, ONNX conversion can preserve unresolved input dynamics while
still lowering shape-driving expressions needed by WebNN.</p>

<p>That lets us keep flexibility where possible while still emitting valid WebNN
graphs.</p>

<h2 id="smollm-135m-converted-and-running">SmolLM-135M converted and running</h2>

<p>With flexible inputs supported end to end, SmolLM-135M converts cleanly: no
shape rewriting hacks, no per-length exports, no WASM fallback in the decode
loop. The artifacts are published here:</p>

<ul>
  <li><a href="https://huggingface.co/tarekziade/SmolLM-135M-webnn">tarekziade/SmolLM-135M-webnn</a></li>
</ul>

<p>Then I built a Python demo in
<a href="https://github.com/rustnn/pywebnn/blob/tarekziade-flexible-input/examples/smollm_from_hub.py"><code class="language-plaintext highlighter-rouge">pywebnn/examples/smollm_from_hub.py</code></a>
that:</p>

<ul>
  <li>downloads <code class="language-plaintext highlighter-rouge">model.webnn</code>, <code class="language-plaintext highlighter-rouge">model.weights</code>, and manifest from the Hub</li>
  <li>downloads <code class="language-plaintext highlighter-rouge">tokenizer.json</code></li>
  <li>runs token-by-token generation with dynamic KV cache growth</li>
  <li>optionally compares output against <code class="language-plaintext highlighter-rouge">transformers</code></li>
</ul>

<p>A few extracts from that demo:</p>

<p>The demo defaults to the Hub-hosted WebNN artifacts:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DEFAULT_MODEL_ID</span> <span class="o">=</span> <span class="s">"tarekziade/SmolLM-135M-webnn"</span>

<span class="n">model_files</span> <span class="o">=</span> <span class="n">resolve_model_files</span><span class="p">(</span><span class="n">args</span><span class="p">.</span><span class="n">model_id</span><span class="p">,</span> <span class="n">force</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">force_download</span><span class="p">)</span>
<span class="n">graph</span> <span class="o">=</span> <span class="n">webnn</span><span class="p">.</span><span class="n">MLGraph</span><span class="p">.</span><span class="n">load</span><span class="p">(</span>
    <span class="n">model_files</span><span class="p">[</span><span class="s">"graph"</span><span class="p">],</span>
    <span class="n">manifest_path</span><span class="o">=</span><span class="n">model_files</span><span class="p">[</span><span class="s">"manifest"</span><span class="p">],</span>
    <span class="n">weights_path</span><span class="o">=</span><span class="n">model_files</span><span class="p">[</span><span class="s">"weights"</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">past_key_values</code> holds the growing KV-cache tensors returned by the previous
step. The decode loop feeds them back on every token:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">run_step</span><span class="p">(</span><span class="n">token_id</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">position</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">:</span>
    <span class="n">inputs</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"input_ids"</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="n">token_id</span><span class="p">]],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">int64</span><span class="p">),</span>
        <span class="s">"position_ids"</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="n">position</span><span class="p">]],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">int64</span><span class="p">),</span>
        <span class="s">"attention_mask"</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="n">position</span> <span class="o">+</span> <span class="mi">1</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">int64</span><span class="p">),</span>
        <span class="o">**</span><span class="n">past_key_values</span><span class="p">,</span>
    <span class="p">}</span>
    <span class="n">outputs</span> <span class="o">=</span> <span class="n">context</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">inputs</span><span class="p">)</span>
    <span class="p">...</span>
</code></pre></div></div>

<p>The demo can also run a <code class="language-plaintext highlighter-rouge">transformers</code> baseline and fail fast on divergence:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">args</span><span class="p">.</span><span class="n">compare_transformers</span><span class="p">:</span>
    <span class="n">hf_generated</span><span class="p">,</span> <span class="n">hf_text</span><span class="p">,</span> <span class="n">hf_prompt_ids</span> <span class="o">=</span> <span class="n">run_transformers_baseline</span><span class="p">(...)</span>
    <span class="p">...</span>
    <span class="k">if</span> <span class="n">generated_text</span> <span class="o">!=</span> <span class="n">hf_text</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[ERROR] WebNN and transformers generated different text output"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<p>Correctness checks against <code class="language-plaintext highlighter-rouge">transformers</code> are critical. Performance
improvements mean nothing if generation diverges.</p>

<h2 id="lessons-learned">Lessons learned</h2>

<ul>
  <li>Fully unbounded dynamic shapes are rarely necessary for practical decode loops</li>
  <li>Bounded flexibility captures most real workloads while keeping backends sane</li>
  <li>Python exposure (<code class="language-plaintext highlighter-rouge">pywebnn</code>) accelerates model validation and ecosystem feedback</li>
</ul>

<h2 id="what-is-next">What is next</h2>

<p>Flexible inputs will likely be important if WebNN is to support real LLM workloads.</p>

<p>Static graphs alone are not enough for modern inference. Bounded flexibility is
the pragmatic bridge.</p>

<p>And while this work pushes WebNN forward, we are also giving a lot of love to
the TensorRT backend these days, because high-performance local inference matters
just as much as API design.</p>

<h2 id="links">Links</h2>

<ul>
  <li>rustnn docs: <a href="https://rustnn.github.io/rustnn/">https://rustnn.github.io/rustnn/</a></li>
  <li>pywebnn docs: <a href="https://rustnn.github.io/pywebnn/">https://rustnn.github.io/pywebnn/</a></li>
  <li>rustnn WPT conformance dashboard: <a href="https://rustnn.github.io/rustnnpt/">https://rustnn.github.io/rustnnpt/</a></li>
</ul>]]></content><author><name>Tarek Ziade</name></author><category term="WebNN" /><category term="Rust" /><category term="LLM" /><category term="SmolLM" /><category term="rustnn" /><summary type="html"><![CDATA[WebNN is emerging as a portable, browser-friendly inference API. But LLMs hit a hard wall: dynamic inputs.]]></summary></entry><entry><title type="html">Catching Code Complexity with a Local LLM</title><link href="https://tarekziade.github.io/2026/01/31/catching-quadratic-complexity/" rel="alternate" type="text/html" title="Catching Code Complexity with a Local LLM" /><published>2026-01-31T00:00:00+00:00</published><updated>2026-01-31T00:00:00+00:00</updated><id>https://tarekziade.github.io/2026/01/31/catching-quadratic-complexity</id><content type="html" xml:base="https://tarekziade.github.io/2026/01/31/catching-quadratic-complexity/"><![CDATA[<p>Performance issues in Python often don’t look like bugs.</p>

<p>They don’t crash, they don’t fail tests, and they don’t stand out in code review.
They just quietly turn into cliffs when the input size grows.</p>

<p>This post is about one such performance fix in <code class="language-plaintext highlighter-rouge">transformers</code>, what it revealed,
and a small experiment that came out of it: <strong>LoopSleuth</strong>, a local LLM-powered
complexity scanner.</p>

<h2 id="it-started-with-a-tokenizer-converter">It Started With a Tokenizer Converter</h2>

<p>While working on <code class="language-plaintext highlighter-rouge">transformers</code>, I fixed a performance issue in
<a href="https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py"><code class="language-plaintext highlighter-rouge">convert_slow_tokenizer.py</code></a>
that took a tokenizer conversion step from <strong>4 minutes</strong> down to <strong>~1 second</strong>
when running on very large vocabularies (100k+ tokens).</p>

<h3 id="the-test-that-surfaced-it">The Test That Surfaced It</h3>

<p>This started when CI flagged <code class="language-plaintext highlighter-rouge">test_voxtral_tokenizer_converts_from_tekken</code> as
the slowest test in the suite.</p>

<p>The test loads <code class="language-plaintext highlighter-rouge">mistralai/Voxtral-Mini-3B-2507</code> and forces the fallback path to
<code class="language-plaintext highlighter-rouge">TokenizersBackend</code>.</p>

<p>That fallback triggers the slow→fast tokenizer conversion step — and that
conversion was doing repeated <code class="language-plaintext highlighter-rouge">.index()</code> lookups inside a sort key, turning
large vocabularies into a performance cliff.</p>

<p>The root cause was a classic scaling trap.</p>

<h3 id="the-original-pattern">The Original Pattern</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># BEFORE (simplified excerpt)
</span><span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">bpe_ranks</span><span class="p">):</span>
    <span class="n">local</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span>
        <span class="n">local</span><span class="p">,</span>
        <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span>
            <span class="n">bpe_ranks</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
            <span class="n">bpe_ranks</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span>
        <span class="p">),</span>
    <span class="p">)</span>
</code></pre></div></div>

<p>(Simplified excerpt — the key issue is the repeated <code class="language-plaintext highlighter-rouge">.index()</code> inside the sort
key.)</p>

<p>At first glance this looks harmless.</p>

<p>But <code class="language-plaintext highlighter-rouge">list.index()</code> is <strong>O(n)</strong>.</p>

<p>And the real killer is that it happens inside a <code class="language-plaintext highlighter-rouge">sorted()</code> key function.</p>

<p>Sorting <code class="language-plaintext highlighter-rouge">local</code> means computing the key for every element, and each key performs
two linear searches through <code class="language-plaintext highlighter-rouge">bpe_ranks</code>: <code class="language-plaintext highlighter-rouge">sorted()</code> calls the key function once
per element (O(m)), and each key calls <code class="language-plaintext highlighter-rouge">.index()</code> twice (O(n)), so the total
becomes O(m·n) — often a scaling trap when m and n are both large.</p>

<h3 id="the-fix">The Fix</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># AFTER (reduces key computation from O(n) to O(1))
</span><span class="n">token_to_rank</span> <span class="o">=</span> <span class="p">{</span><span class="n">token</span><span class="p">:</span> <span class="n">rank</span> <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">bpe_ranks</span><span class="p">)}</span>

<span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">bpe_ranks</span><span class="p">):</span>
    <span class="n">local</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span>
        <span class="n">local</span><span class="p">,</span>
        <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span>
            <span class="n">token_to_rank</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span>
            <span class="n">token_to_rank</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span>
        <span class="p">),</span>
    <span class="p">)</span>
</code></pre></div></div>

<p>The optimization is simple:</p>

<ul>
  <li>replace repeated linear searches with constant-time dictionary lookups</li>
</ul>

<p>This doesn’t eliminate all sorting work (the outer loop still sorts repeatedly),
but it removes the quadratic lookup cost that was dominating runtime.</p>

<p>The takeaway wasn’t just “use dicts” — it was that asymptotic traps often hide
in perfectly valid Python idioms.</p>

<h2 id="could-this-have-been-caught-automatically">Could This Have Been Caught Automatically?</h2>

<p>After landing that fix, I kept wondering:</p>

<blockquote>
  <p>How many other places in the codebase have the exact same pattern?</p>
</blockquote>

<p>This wasn’t a correctness issue:</p>

<ul>
  <li>everything worked</li>
  <li>tests passed</li>
  <li>the slowdown only appeared at scale</li>
</ul>

<p>And none of the linting tools I normally rely on flagged it.</p>

<p>Ruff’s PERF rules catch obvious constructs like unnecessary list copies, but
they don’t reason about <code class="language-plaintext highlighter-rouge">.index()</code> inside a sort key.</p>

<p>In theory, a linter <em>could</em> detect patterns like:</p>

<ul>
  <li>repeated <code class="language-plaintext highlighter-rouge">.index()</code> inside loops</li>
  <li><code class="language-plaintext highlighter-rouge">.index()</code> inside sort keys</li>
  <li>nested iteration over growing structures</li>
</ul>

<p>But most rule-based linters avoid making strong claims about asymptotic
complexity.</p>

<p>That’s a reasonable trade-off: linters are fast, deterministic, and low-noise —
but they often miss scaling issues unless you add very specific custom rules.</p>

<p>This is where I started wondering whether an LLM could help fill the gap.</p>

<h2 id="scanning-transformers-with-claude">Scanning Transformers With Claude</h2>

<p>As an experiment, I ran Claude Code over the repository with one question:</p>

<blockquote>
  <p>Find quadratic complexity patterns similar to the tokenizer converter bug.</p>
</blockquote>

<p>The result was surprisingly useful.</p>

<p>It scanned ~3,000 Python functions across the codebase in a few minutes and
flagged ~20 instances of the same anti-pattern:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">.index()</code> inside loops</li>
  <li><code class="language-plaintext highlighter-rouge">.index()</code> inside sort keys</li>
  <li>nested iteration patterns with superlinear blow-up at scale</li>
</ul>

<p>About half were genuine hot-path candidates; others were technically quadratic
but not performance-critical in practice.</p>

<p>For example:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ymls</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">results</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="n">x</span><span class="p">[:</span><span class="o">-</span><span class="mi">4</span><span class="p">]))</span>
</code></pre></div></div>

<p>Or:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aspect_ratios_ids</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">supported_aspect_ratios</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="n">ratio</span><span class="p">)</span>
</code></pre></div></div>

<p>All fixable with the same technique:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lookup</span> <span class="o">=</span> <span class="p">{</span><span class="n">val</span><span class="p">:</span> <span class="n">idx</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">reference</span><span class="p">)}</span>
</code></pre></div></div>

<p>This report was a great proof of concept.</p>

<p>But it relied on a large hosted model.</p>

<h2 id="the-question-became-can-this-work-locally">The Question Became: Can This Work Locally?</h2>

<p>Instead of running a massive model in the cloud, I wanted to know:</p>

<ul>
  <li>could a small local model catch these patterns?</li>
  <li>could we build something closer to a linter?</li>
  <li>could we automate complexity review?</li>
</ul>

<p>That’s how I ended up hacking together a small prototype I called <strong>LoopSleuth</strong>.</p>

<h2 id="why-rust--llamacpp">Why Rust + llama.cpp?</h2>

<p>My first instinct was to build this as a Python script on top of
<code class="language-plaintext highlighter-rouge">transformers</code> itself.</p>

<p>But I wanted this experiment to be:</p>

<ul>
  <li>fast startup time</li>
  <li>easy CI binary distribution</li>
  <li>no Python runtime dependency</li>
  <li>easy to integrate into tooling</li>
</ul>

<p>A single static binary makes it easy to drop into CI, like Ruff.</p>

<p>And honestly, I also wanted an excuse to explore the Rust ecosystem that powers
tools like <strong>Ruff</strong> and <strong>Ty</strong>.</p>

<p>So LoopSleuth is written in Rust and uses:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rustpython-parser</code> to extract functions</li>
  <li><code class="language-plaintext highlighter-rouge">llama.cpp</code> bindings for local inference</li>
</ul>

<p>In practice, a small model like <strong>Qwen2.5-Coder 3B (Q4)</strong> already gives
surprisingly good results for this narrow task.</p>

<h2 id="loopsleuth-a-small-complexity-scanner">LoopSleuth: A Small Complexity Scanner</h2>

<p>LoopSleuth is a CLI tool that:</p>

<ol>
  <li>parses Python modules</li>
  <li>extracts functions (each function is analyzed in isolation: signature + body, without full module context)</li>
  <li>sends each function to a local LLM</li>
  <li>asks a focused question:</li>
</ol>

<blockquote>
  <p>Does this contain patterns that may scale quadratically?</p>
</blockquote>

<p>If the model answers “QUADRATIC”, it also asks for an optimization suggestion.</p>

<p>This framing treats complexity as a heuristic warning (like a linter) rather
than a mathematical proof.</p>

<h3 id="how-it-works">How It Works</h3>

<p>The prompt is deliberately simple and constrained:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Classify this function as OK or QUADRATIC.
Look for list.index(), nested loops, or linear operations inside loops.
Return only one word: OK or QUADRATIC.
</code></pre></div></div>

<p>This makes the model focus on structural patterns rather than trying to perform
full dataflow analysis, and the constrained output format makes parsing reliable.</p>

<p>Example output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>⚠️  Functions with O(n²) complexity: 5

🔴 QUADRATIC COMPLEXITY DETECTED IN:
  • bubble_sort
  • find_duplicates
  • remove_elements
  • string_concatenation
  • nested_comparison
</code></pre></div></div>

<p>Because it’s a CLI, it can be used in a few practical ways:</p>

<ul>
  <li>as a local complexity scanner during development</li>
  <li>as a lightweight pre-pass before calling a large cloud model (reducing token usage)</li>
  <li>as a GitHub Action on pull requests to catch patches that introduce quadratic behavior</li>
</ul>

<h2 id="why-not-just-use-existing-linters">Why Not Just Use Existing Linters?</h2>

<p>Before building anything, I tried the usual suspects.</p>

<p>Tools like <strong>Ruff</strong>, <strong>Pylint</strong>, and performance-focused plugins can catch a lot:</p>

<ul>
  <li>Pylint warns about string concatenation in loops (<code class="language-plaintext highlighter-rouge">consider-using-join</code>)</li>
  <li>Ruff has <code class="language-plaintext highlighter-rouge">PERF</code> rules inspired by Perflint</li>
</ul>

<p>But none of the linters I tried really caught the pattern that triggered this
whole experiment:</p>

<ul>
  <li>repeated <code class="language-plaintext highlighter-rouge">.index()</code> lookups inside loops</li>
  <li><code class="language-plaintext highlighter-rouge">.index()</code> inside sort key functions</li>
  <li>nested iteration patterns that only become problematic at scale</li>
</ul>

<p>These tools are excellent at enforcing specific rules, but they generally don’t
try to answer:</p>

<blockquote>
  <p>“Does this function scale quadratically with input size?”</p>
</blockquote>

<p>That gap is what made the LLM approach interesting to explore.</p>

<h3 id="a-quick-comparison">A Quick Comparison</h3>

<p>One thing I wanted to sanity-check early was whether existing linters would
catch the same issues.</p>

<p>So I built a small test file with a handful of intentionally quadratic
functions (nested loops, <code class="language-plaintext highlighter-rouge">.remove()</code> in loops, string concatenation, etc.) and
ran:</p>

<ul>
  <li>LoopSleuth</li>
  <li>Ruff (with <code class="language-plaintext highlighter-rouge">--select ALL</code>)</li>
  <li>Pylint</li>
</ul>

<p>The results were pretty stark:</p>

<table>
  <thead>
    <tr>
      <th>Tool</th>
      <th>Detects <code class="language-plaintext highlighter-rouge">.index()</code> in loop?</th>
      <th>Reports complexity?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Ruff</td>
      <td>❌</td>
      <td>❌</td>
    </tr>
    <tr>
      <td>Pylint</td>
      <td>❌</td>
      <td>❌</td>
    </tr>
    <tr>
      <td>LoopSleuth</td>
      <td>✅</td>
      <td>✅ (heuristic)</td>
    </tr>
  </tbody>
</table>

<p>LoopSleuth flagged all 5 quadratic functions, while Ruff and Pylint flagged
plenty of style and quality issues but neither directly reported algorithmic
complexity problems.</p>

<p>This isn’t really a criticism of those tools — they’re simply not designed for
that job.</p>

<p>I wrote up the full comparison here:</p>

<ul>
  <li><a href="https://github.com/tarekziade/loopsleuth/blob/main/docs/COMPARISON.md">LoopSleuth vs Linters Comparison</a></li>
</ul>

<p>To be clear, there may be ways to approximate some of these checks with custom
rules or plugins, and linters remain the first line of defense for code quality.</p>

<p>LoopSleuth is just exploring a different axis: scaling behavior.</p>

<h2 id="still-an-experiment">Still an Experiment</h2>

<p>LoopSleuth is not a replacement for linters.</p>

<p>It’s a small experiment.</p>

<p>Traditional linters like Ruff or Pylint excel at catching specific code smells.
But most scaling bugs don’t come from a single construct.
They come from composition:</p>

<ul>
  <li>nested iteration</li>
  <li>repeated membership checks</li>
  <li>linear operations inside loops</li>
</ul>

<p>Rule-based linters struggle to infer:</p>

<ul>
  <li>“this <code class="language-plaintext highlighter-rouge">.index()</code> is inside a hot path”</li>
  <li>“this loop is over the same input size”</li>
  <li>“this becomes O(n²) at scale”</li>
</ul>

<p>LLMs, even small ones, can often reason about these patterns more directly.</p>

<p>That said, LoopSleuth runs against isolated Python functions one by one, which
means it doesn’t yet understand:</p>

<ul>
  <li>cross-function context</li>
  <li>runtime sizes</li>
  <li>whether a loop is actually hot in practice</li>
</ul>

<h3 id="limitations">Limitations</h3>

<p>Like any heuristic tool, LoopSleuth has trade-offs:</p>

<p><strong>False positives:</strong></p>
<ul>
  <li>small fixed-size loops that never scale</li>
  <li>code in non-hot paths</li>
  <li>patterns that look quadratic but have early exits</li>
</ul>

<p><strong>False negatives:</strong></p>
<ul>
  <li>complexity hidden across function calls</li>
  <li>indirect iteration patterns</li>
  <li>subtle algorithm choices</li>
</ul>

<p>The accuracy depends heavily on prompt design and model choice.</p>

<p><strong>Important:</strong> LoopSleuth is a screening tool, not a replacement for profiling
or benchmarking. It flags patterns that may cause issues, but only real
measurements can confirm actual performance problems.</p>

<p>More broadly, I’m interested in whether this approach can extend beyond
complexity analysis to other classes of performance issues.</p>

<p>One direction would be to build a small library of prompts for:</p>

<ul>
  <li>repeated tensor conversions</li>
  <li>hidden CPU/GPU sync points</li>
  <li>accidental re-tokenization</li>
</ul>

<p>And in an ideal world, we could fine-tune a small model (like Qwen2.5-Coder 3B)
to specialize on this kind of performance reasoning.</p>

<h3 id="whats-next">What’s Next</h3>

<p>If this experiment proves useful, here are some directions worth exploring:</p>

<ul>
  <li><strong>AST-based prefiltering</strong> to skip obviously safe functions</li>
  <li><strong>Caching inference results</strong> to avoid re-analyzing unchanged code</li>
  <li><strong>Training on real perf bugs</strong> from issue trackers and PRs</li>
  <li><strong>GitHub Actions integration</strong> to catch regressions in CI</li>
</ul>

<p>Right now LoopSleuth is a proof of concept, but these extensions could make it
practical for real codebases.</p>

<h2 id="conclusion">Conclusion</h2>

<p>LoopSleuth started as a simple question:</p>

<blockquote>
  <p>Could we catch quadratic complexity bugs automatically?</p>
</blockquote>

<p>The answer is: not perfectly.</p>

<p>But even a small local model can spot surprising amounts of hidden O(n²)
behavior.</p>

<p>And as codebases grow — especially ones like <code class="language-plaintext highlighter-rouge">transformers</code> — performance traps
scale with them.</p>

<p>LoopSleuth is a small experiment toward making complexity visible earlier.</p>

<p>If you’re interested, the project is here:</p>

<ul>
  <li><a href="https://github.com/tarekziade/loopsleuth">LoopSleuth on GitHub</a></li>
</ul>

<p>If you have examples of hidden scaling bugs or want to contribute detection
patterns, I’d love to collect them as test cases. Feel free to try it locally or
open an issue.</p>]]></content><author><name>Tarek Ziade</name></author><category term="llm" /><category term="performance" /><category term="transformers" /><category term="tooling" /><category term="rust" /><summary type="html"><![CDATA[Performance issues in Python often don’t look like bugs.]]></summary></entry><entry><title type="html">The Economics of AI Coding: A Real-World Analysis</title><link href="https://tarekziade.github.io/2026/01/14/the-economics-of-ai-coding/" rel="alternate" type="text/html" title="The Economics of AI Coding: A Real-World Analysis" /><published>2026-01-14T00:00:00+00:00</published><updated>2026-01-14T00:00:00+00:00</updated><id>https://tarekziade.github.io/2026/01/14/the-economics-of-ai-coding</id><content type="html" xml:base="https://tarekziade.github.io/2026/01/14/the-economics-of-ai-coding/"><![CDATA[<p>My whole stream in the past months has been about AI coding. From skeptical
engineers who say it creates unmaintainable code, to enthusiastic (or scared)
engineers who say it will replace us all, the discourse is polarized. But I’ve
been more interested in a different question: what does AI coding actually cost,
and what does it actually save?</p>

<p>I recently had Claude help me with a substantial refactoring task: splitting a
monolithic Rust project into multiple workspace repositories with proper
dependency management. The kind of task that’s tedious, error-prone, and
requires sustained attention to detail across hundreds of files. When it was
done, I asked Claude to analyze the session: how much it cost, how long it took,
and how long a human developer would have taken.</p>

<p>The answer surprised me. Not because AI was faster or cheaper (that’s expected),
but because of how much faster and cheaper.</p>

<h2 id="the-task-repository-split-and-workspace-setup">The Task: Repository Split and Workspace Setup</h2>

<p>The work involved:</p>
<ul>
  <li>Planning and researching the codebase structure</li>
  <li>Migrating code between three repositories</li>
  <li>Updating thousands of import statements</li>
  <li>Configuring Cargo workspaces and dependencies</li>
  <li>Writing Makefiles and build system configuration</li>
  <li>Setting up CI/CD workflows with GitHub Actions</li>
  <li>Updating five different documentation files</li>
  <li>Running and verifying 2300+ tests</li>
  <li>Creating branches and writing detailed commit messages</li>
</ul>

<p>This is real work. Not a toy problem, not a contrived benchmark. The kind of multi-day slog that every engineer has faced: important but tedious, requiring precision but not creativity.</p>

<h2 id="the-numbers">The Numbers</h2>

<h3 id="ai-execution-time">AI Execution Time</h3>
<p>Total: approximately 3.5 hours across two sessions</p>
<ul>
  <li>First session (2-3 hours): Initial setup, file operations, dependency configuration, build testing, CI/CD setup</li>
  <li>Second session (15-20 minutes): Documentation updates, branch creation, final commits, todo tracking</li>
</ul>

<h3 id="ai-cost">AI Cost</h3>
<p>Total tokens: 72,146 tokens</p>
<ul>
  <li>Input tokens: ~45,000 (context, file reads, system prompts)</li>
  <li>Output tokens: ~27,000 (tool calls, code generation, documentation)</li>
</ul>

<p>Estimated marginal cost: approximately $4.95</p>
<ul>
  <li>Input: ~$0.90 (at ~$3/M tokens for Sonnet 4.5)</li>
  <li>Output: ~$4.05 (at ~$15/M tokens for Sonnet 4.5)</li>
</ul>

<p>This is the marginal execution cost for this specific task. It doesn’t include my Claude subscription, the time I spent iterating on prompts and reviewing output, or the risk of having to revise or fix AI-generated changes. For a complete accounting, you’d also need to consider those factors, though for this task they were minimal.</p>

<h3 id="human-developer-time-estimate">Human Developer Time Estimate</h3>
<p>Conservative estimate: 2-3 days (16-24 hours)</p>

<p>This is my best guess based on experience with similar tasks, but it comes with uncertainty. A senior engineer deeply familiar with this specific codebase might work faster. Someone encountering similar patterns for the first time might work slower. Some tasks could be partially templated or parallelized across a team.</p>

<p>Breaking down the work:</p>
<ol>
  <li>Planning and research (2-4 hours): Understanding codebase structure, planning dependency strategy, reading PyO3/Maturin documentation</li>
  <li>Code migration (4-6 hours): Copying files, updating all import statements, fixing compilation errors, resolving workspace conflicts</li>
  <li>Build system setup (2-3 hours): Writing Makefile, configuring Cargo.toml, setting up pyproject.toml, testing builds</li>
  <li>CI/CD configuration (2-4 hours): Writing GitHub Actions workflows, testing syntax, debugging failures, setting up matrix builds</li>
  <li>Documentation updates (2-3 hours): Updating multiple documentation files, ensuring consistency, writing migration guides</li>
  <li>Testing and debugging (3-5 hours): Running test suites, fixing unexpected failures, verifying tests pass, testing on different platforms</li>
  <li>Git operations and cleanup (1-2 hours): Creating branches, writing commit messages, final verification</li>
</ol>

<p>Even if we’re generous and assume a very experienced developer could complete this in 8 hours of focused work, the time and cost advantages remain substantial. The economics don’t depend on the precise estimate.</p>

<h3 id="the-bottom-line">The Bottom Line</h3>
<ul>
  <li>AI: ~3.5 hours, ~$5 marginal cost</li>
  <li>Human: ~16-24 hours, ~$800-$2,400 (at $50-100/hr developer rate)</li>
  <li>Savings: approximately 85-90% time reduction, approximately 99% marginal cost reduction</li>
</ul>

<p>These numbers compare execution time and per-task marginal costs. They don’t capture everything (platform costs, review time, long-term maintenance implications), but they illustrate the scale of the difference for this type of systematic refactoring work.</p>

<h2 id="why-ai-was-faster">Why AI Was Faster</h2>

<p>The efficiency gains weren’t magic. They came from specific characteristics of how AI approaches systematic work:</p>

<p><strong>No context switching fatigue.</strong> Claude maintained focus across three repositories simultaneously without the cognitive load that would exhaust a human developer. No mental overhead from jumping between files, no “where was I?” moments after a break.</p>

<p><strong>Instant file operations.</strong> Reading and writing files happens without the delays of IDE loading, navigation, or search. What takes a human seconds per file took Claude milliseconds.</p>

<p><strong>Pattern matching without mistakes.</strong> Updating thousands of import statements consistently, without typos, without missing edge cases. No ctrl-H mistakes, no regex errors that you catch three files later.</p>

<p><strong>Parallel mental processing.</strong> Tracking multiple files at once without the working memory constraints that force humans to focus narrowly.</p>

<p><strong>Documentation without overhead.</strong> Generating comprehensive, well-structured documentation in one pass. No switching to a different mindset, no “I’ll document this later” debt.</p>

<p><strong>Error recovery.</strong> When workspace conflicts or dependency issues appeared, Claude fixed them immediately without the frustration spiral that can derail a human’s momentum.</p>

<p><strong>Commit message quality.</strong> Detailed, well-structured commit messages generated instantly. No wrestling with how to summarize six hours of work into three bullet points.</p>

<h2 id="what-took-longer">What Took Longer</h2>

<p>AI wasn’t universally faster. Two areas stood out:</p>

<p><strong>Initial codebase exploration.</strong> Claude spent time systematically understanding the structure before implementing. A human developer might have jumped in faster with assumptions (though possibly paying for it later with rework).</p>

<p><strong>User preference clarification.</strong> Some back-and-forth on git dependencies versus crates.io, version numbering conventions. A human working alone would just make these decisions implicitly based on their experience.</p>

<p>These delays were minimal compared to the overall time savings, but they’re worth noting. AI coding isn’t instantaneous magic. It’s a different kind of work with different bottlenecks.</p>

<h2 id="the-economics-of-coding">The Economics of Coding</h2>

<p>Let me restate those numbers because they still feel surreal:</p>
<ul>
  <li>85-90% time reduction</li>
  <li>99% marginal cost reduction</li>
</ul>

<p>For this type of task, these are order-of-magnitude improvements over solo human execution. And they weren’t achieved through cutting corners or sacrificing immediate quality. The tests passed, the documentation was comprehensive, the commits were well-structured, the code compiled cleanly.</p>

<p>That said, tests passing and documentation existing are necessary but not sufficient signals of quality. Long-term maintainability, latent bugs that only surface later, or future refactoring friction are harder to measure immediately. The code is working, but it’s too soon to know if there are subtle issues that will emerge over time.</p>

<p>This creates strange economics for a specific class of work: systematic, pattern-based refactoring with clear success criteria. For these tasks, the time and cost reductions change how we value engineering effort and prioritize maintenance work.</p>

<p>I used to avoid certain refactorings because the payoff didn’t justify the time investment. Clean up import statements across 50 files? Update documentation after a restructure? Write comprehensive commit messages? These felt like luxuries when there was always more pressing work.</p>

<p>But at $5 marginal cost and 3.5 hours for this type of systematic task, suddenly they’re not trade-offs anymore. They’re obvious wins. The economics shift from “is this worth doing?” to “why haven’t we done this yet?”</p>

<h2 id="what-this-doesnt-mean">What This Doesn’t Mean</h2>

<p>Before the “AI will replace developers” crowd gets too excited, let me be clear about what this data doesn’t show:</p>

<p>This was a perfect task for AI. Systematic, pattern-based, well-scoped, with clear success criteria. The kind of work where following existing patterns and executing consistently matters more than creative problem-solving or domain expertise.</p>

<p>AI did not:</p>
<ul>
  <li>Design the architecture (I did)</li>
  <li>Decide on the repository structure (I did)</li>
  <li>Choose the dependency strategy (we decided together)</li>
  <li>Understand the business context (I provided it)</li>
  <li>Know whether the tests passing meant the code was correct (I validated)</li>
</ul>

<p>The task was pure execution. Important execution, skilled execution, but execution nonetheless. A human developer would have brought the same capabilities to the table, just slower and at higher cost.</p>

<h2 id="where-this-goes">Where This Goes</h2>

<p>I keep thinking about that 85-90% time reduction for this specific type of task. Not simple one-liners where AI already shines, but systematic maintenance work with high regularity, strong compiler or test feedback, and clear end states.</p>

<p>Tasks with similar characteristics might include:</p>
<ul>
  <li>Updating deprecated APIs across a large codebase</li>
  <li>Migrating from one framework to another with clear patterns</li>
  <li>Standardizing code style and patterns</li>
  <li>Refactoring for testability where tests guide correctness</li>
  <li>Adding comprehensive logging and monitoring</li>
  <li>Writing and updating documentation</li>
  <li>Creating detailed migration guides</li>
</ul>

<p>Many maintenance tasks are messier: ambiguous semantics, partial test coverage, undocumented invariants, organizational constraints. The economics I observed here don’t generalize to all refactoring work. But for the subset that is systematic and well-scoped, the shift is significant.</p>

<p>All the work that we know we should do but often defer because it doesn’t feel like progress. What if the economics shifted enough for these specific tasks that deferring became the irrational choice?</p>

<p>I’m not suggesting AI replaces human judgment. Someone still needs to decide what “good” looks like, validate the results, understand the business context. But if the execution of systematic work becomes 10x cheaper and faster, maybe we stop treating certain categories of technical debt like unavoidable burdens and start treating them like things we can actually manage.</p>

<h2 id="the-real-cost">The Real Cost</h2>

<p>There’s one cost the analysis didn’t capture: my time. I wasn’t passive during those 3.5 hours. I was reading Claude’s updates, reviewing file changes, answering questions, validating decisions, checking test results.</p>

<p>I don’t know exactly how much time I spent, but it was less than the 3.5 hours Claude was working. Maybe 2 hours of active engagement? The rest was Claude working autonomously while I did other things.</p>

<p>So the real comparison isn’t 3.5 AI hours versus 16-24 human hours. It’s 2 hours of human guidance plus 3.5 hours of AI execution versus 16-24 hours of human solo work. Still a massive win, but different from pure automation.</p>

<p>This feels like the right model: AI as an extremely capable assistant that amplifies human direction rather than replacing human judgment. The economics work because you’re multiplying effectiveness, not substituting one for the other.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>Five dollars marginal cost. Three and a half hours. For systematic refactoring work that would have taken me days and cost hundreds or thousands of dollars in my time.</p>

<p>These numbers make me think differently about certain kinds of work. About how we prioritize technical debt in the systematic, pattern-based category. About what “too expensive to fix” really means for these specific tasks. About whether we’re approaching some software maintenance decisions with outdated economic assumptions.</p>

<p>I’m still suspicious of broad claims that AI fundamentally changes how we work. But I’m less suspicious than I was. When the economics shift this dramatically for a meaningful class of tasks, some things that felt like pragmatic trade-offs start to look different.</p>

<p>The tests pass. The documentation is up to date. And I paid less than the cost of a fancy coffee drink.</p>

<p>Maybe the skeptics and the enthusiasts are both right. Maybe AI doesn’t replace developers and maybe it does change some things meaningfully. Maybe it just makes certain kinds of systematic work cheap enough that we can finally afford to do them right.</p>

<h2 id="what-about-model-and-pricing-changes">What About Model and Pricing Changes?</h2>

<p>One caveat worth noting: these economics depend on Claude Sonnet 4.5 at January 2026 pricing. Model pricing can change, model performance can regress or improve with updates, tool availability can shift, and organizational data governance constraints might limit what models you can use or what tasks you can delegate to them.</p>

<p>For individuals and small teams, this might not matter much in the short term. For larger organizations making long-term planning decisions, these factors matter. The specific numbers here are a snapshot, not a guarantee.</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://claude.com/claude-code">Claude Code</a> - The AI coding assistant used for this project</li>
  <li><a href="https://github.com/rustnn/rustnn">rustnn project</a> - The repository that was split</li>
  <li>Token pricing based on <a href="https://www.anthropic.com/api">Claude API pricing</a> as of January 2026</li>
</ul>]]></content><author><name>Tarek Ziade</name></author><category term="ai" /><category term="claude" /><category term="productivity" /><category term="software-engineering" /><summary type="html"><![CDATA[My whole stream in the past months has been about AI coding. From skeptical engineers who say it creates unmaintainable code, to enthusiastic (or scared) engineers who say it will replace us all, the discourse is polarized. But I’ve been more interested in a different question: what does AI coding actually cost, and what does it actually save?]]></summary></entry><entry><title type="html">all the code are belong to claude*</title><link href="https://tarekziade.github.io/ai/2025/12/20/all-the-code-are-belong-to-claude/" rel="alternate" type="text/html" title="all the code are belong to claude*" /><published>2025-12-20T00:00:00+00:00</published><updated>2025-12-20T00:00:00+00:00</updated><id>https://tarekziade.github.io/ai/2025/12/20/all-the-code-are-belong-to-claude</id><content type="html" xml:base="https://tarekziade.github.io/ai/2025/12/20/all-the-code-are-belong-to-claude/"><![CDATA[<p>I have been writing code for a long time, long enough to be suspicious of tools
that claim to fundamentally change how I work. And yet, here we are.</p>

<p>The latest iterations of Claude Code are genuinely impressive. Not in a flashy
demo way, but in the quiet, dangerous way where you suddenly realize you have
delegated large parts of your thinking to it. This post is about that
experience, how Claude helped me build <code class="language-plaintext highlighter-rouge">rustnn</code>, what worked remarkably well,
and where I had to consciously pull myself back.</p>

<h2 id="claude-as-a-serious-coding-partner">Claude as a serious coding partner</h2>

<p>For rustnn, I leaned heavily on Claude Code. The quality of the generated Rust
was consistently high. Beyond producing correct syntax, it reasoned about what
the code was supposed to do. It was context-aware in a way that made iterative
design feel natural. I could ask for refactors, architectural changes, or
alternative approaches, and get answers that actually respected the existing
codebase and long-running tests.</p>

<p>This mirrors what many developers have been reporting toward the end of 2025.
Claude Code’s agent-oriented design and large-context reasoning make it
particularly strong for repository-wide work: multi-file refactors, non-trivial
debugging sessions, and architectural changes that need to fit an existing
mental model. Compared to Codex-style systems, which still shine for fast edits
and local completions, Claude tends to perform better when the task requires
sustained reasoning and understanding of project-wide constraints.</p>

<p>Anthropic’s recent Claude releases have reinforced that positioning.
Improvements in long-context handling, reasoning depth, and agentic workflows
make it easier to treat Claude as something closer to a collaborator than an
autocomplete engine.</p>

<p>The turning point for me was when I stopped treating Claude like a chat bot and
started treating it like a constrained agent.</p>

<p>That is where CLAUDE.md comes in.</p>

<h2 id="tuning-claudemd">Tuning CLAUDE.md</h2>

<p>I stumbled upon an excellent LangChain article on how to turn Claude Code into a
domain-specific coding agent.</p>

<p>It clicked immediately. Instead of repeatedly explaining the same constraints,
goals, and conventions, I encoded them once. Rust style rules. Project intent.
Explicit boundaries. How to react to test failures.</p>

<p>The effect was immediate. Output quality improved, and the amount of
back-and-forth dropped significantly. Claude stopped proposing things that were
clearly out of scope and started behaving like someone who had actually read and
understood the project.</p>

<p>For rustnn, I went one step further and anchored development around WPT
conformance tests. That gave both Claude and me a shared, objective target.
Tests either pass or they do not. No bikeshedding.</p>

<p>Tweaking CLAUDE.md quickly revealed itself as a never-ending process. There are
plenty of articles describing different approaches, and none of them are
definitive. The current direction seems to be layering information across
multiple files, structuring project documentation so it is optimized for agent
consumption while remaining readable for humans, and doing so without
duplicating the same knowledge in multiple places.</p>

<p>That balance turns out to be just as important as the model itself.</p>

<h2 id="the-slippery-slope">The slippery slope</h2>

<p>There is a trap though, and it is a subtle one.</p>

<p>Once Claude is good enough, you start routing <em>everything</em> through it.</p>

<ul>
  <li>Re-running tests.</li>
  <li>Interpreting obvious build errors.</li>
  <li>Copying and pasting logs that you already understand.</li>
</ul>

<p>It feels efficient, but it is not free. Each interaction has a cost, and when
you are in a tight edit-build-test loop, those costs add up fast. Worse, you
start outsourcing mechanical thinking that you should probably still be doing
yourself.</p>

<p>I definitely fell into that trap.</p>

<h2 id="reducing-costs">Reducing costs</h2>

<p>The solution, for me, was to drastically reduce how much I talk to Claude, and
to stop using its prompt environment as a catch-all interface to the project.</p>

<p>Claude became an extra terminal. One I open for very specific tasks, then close.
It is not a substitute for my own brain, nor for the normal edit–build–test
loop.</p>

<p>Reducing the context window is also critical. A concrete example is Python
tracebacks. They are verbose, repetitive, and largely machine-generated noise.
Sending full tracebacks back to the model is almost always wasteful.</p>

<p>That is why I added a hook to rewrite them on the fly into a compact form.</p>

<p>The idea is simple: keep the signal, drop the boilerplate. Same information, far
fewer tokens. In practice, this not only lowers costs, it often produces better
answers because the model is no longer drowning in irrelevant frames and runtime
noise. On Python-heavy codebases, this change alone reduced my usage costs by
roughly 20%.</p>

<p>Pre-compacting inputs turned out to be one of the most effective cost-control
strategies I have found so far, especially when combined with a more deliberate,
intentional way of interacting with the model.</p>

<h2 id="memory-across-sessions-actually-matters">Memory across sessions actually matters</h2>

<p>Another pain point is session amnesia. You carefully explain design decisions,
trade-offs, and long-term goals, only to repeat them again tomorrow.</p>

<p>A well-crafted <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> mitigates part of this problem. It works well for
static knowledge: coding style, project constraints, architectural boundaries,
and things that rarely change. It gives Claude a stable baseline and avoids a
lot of repetitive explanations.</p>

<p>But it does not capture evolving context.</p>

<p>It does not remember why a specific workaround exists, which approach you
rejected last week, or what subtle behavior a particular test exposed yesterday.
As soon as the session ends, that knowledge is gone, and you are back to
re-teaching the same mental model.</p>

<p>This is where cross-session, cross-project memory becomes interesting.</p>

<p>I am currently experimenting with <code class="language-plaintext highlighter-rouge">claude-mem</code></p>

<p>The idea is simple but powerful: maintain a centralized, persistent memory that
is automatically updated based on interactions. Instead of manually curating
context, relevant facts, decisions, and preferences are summarized and carried
forward. Over time, this builds a lightweight but durable understanding of how
<em>you</em> work and how your projects evolve.</p>

<p>Compared to <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>, this kind of memory is dynamic rather than declarative.
It captures intent, not just rules. It also scales across projects, which
matters when you jump between repositories that share design philosophy,
tooling, or constraints.</p>

<p>It is still early, and it is not magic. You need to be careful about what gets
remembered and how summaries are formed. But the direction feels right.
Persistent memory reduces cognitive reset costs, shortens warm-up time, and
makes the interaction feel less like starting over and more like continuing a
conversation you paused yesterday.</p>

<p>That difference adds up.</p>

<h2 id="final-thoughts">Final thoughts</h2>

<p>Claude Code is good. Very good. Good enough that you need discipline to use it
well.</p>

<p>With a tuned <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>, clear test-driven goals like WPT conformance, and some
tooling to reduce noise and cost, it becomes a powerful accelerator. Without
that discipline, it is easy to overuse it and slowly burn budget on things you
already know how to do.</p>

<p>I do not think this replaces engineering skill. If anything, it amplifies both
good and bad habits. The trick is to make sure it is amplifying the right ones.</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://github.com/tarekziade/claude-tools">My Claude tools</a></li>
  <li><a href="https://blog.langchain.com/how-to-turn-claude-code-into-a-domain-specific-coding-agent/">How to Turn Claude Code into a Domain-Specific Coding Agent</a></li>
  <li><a href="https://www.allaboutai.com/ai-agents/open-ai-codex-vs-github-copilot-vs-claude/">OpenAI Codex vs GitHub Copilot vs Claude</a></li>
  <li><a href="https://www.reuters.com/business/retail-consumer/anthropic-bolsters-ai-model-claudes-coding-agentic-abilities-with-opus-45-2025-11-24/">Anthropic bolsters AI model Claude’s coding and agentic abilities with Opus 4.5</a></li>
  <li><a href="https://github.com/thedotmack/claude-mem">claude-mem</a></li>
</ul>

<p>*The title is a deliberate reference to “All your base are belong to us.” The
grammar is broken on purpose. It is a joke, but also a reminder that when tools
like Claude get this good, it is easy to give them more control than you
intended</p>]]></content><author><name></name></author><category term="ai" /><category term="ai" /><category term="claude" /><category term="rust" /><category term="tooling" /><summary type="html"><![CDATA[I have been writing code for a long time, long enough to be suspicious of tools that claim to fundamentally change how I work. And yet, here we are.]]></summary></entry><entry><title type="html">Why Open Source Is Fundamental in AI (Essay)</title><link href="https://tarekziade.github.io/ai/open-source/web/2025/12/18/open-source-ai/" rel="alternate" type="text/html" title="Why Open Source Is Fundamental in AI (Essay)" /><published>2025-12-18T00:00:00+00:00</published><updated>2025-12-18T00:00:00+00:00</updated><id>https://tarekziade.github.io/ai/open-source/web/2025/12/18/open-source-ai</id><content type="html" xml:base="https://tarekziade.github.io/ai/open-source/web/2025/12/18/open-source-ai/"><![CDATA[<p>Artificial intelligence is becoming a foundational layer of modern software. It is
no longer confined to research labs, but embedded directly in everyday tools and
user experiences.</p>

<p>As AI moves closer to users, openness becomes a question of power. Who can inspect
these systems? Who can adapt them? And who ultimately controls how they evolve?</p>

<p>The web offers a useful reference point. Open source software and open standards
turned the World Wide Web into shared infrastructure rather than a proprietary
stack owned by a single company or government, even if many tried to enclose parts
of it. That openness was not accidental. It shaped who could participate, compete,
and be held accountable.</p>

<h2 id="what-open-source-ai-enables">What Open Source AI Enables</h2>

<p>Open source AI is often reduced to code availability. In practice, and as the Open
Source Initiative (OSI) emphasizes in its Open Source AI Definition, it is about
concrete freedoms.</p>

<p>An open source AI system can be used, studied, modified, and shared. Studying means
inspecting behavior, limits, and failure modes. Modifying means adapting models to
new domains, languages, or constraints. Sharing means deploying systems without
being locked to a single vendor or API.</p>

<p>These freedoms must apply not only to code, but also to models, weights, and the
tooling required to run them. Without that access, reuse is brittle and
understanding remains shallow.</p>

<p>Open source enables verification, reproducibility, and portability. It allows
systems to be audited, adapted, and redeployed independently. In a field defined
by cost, scale, and complexity, these are not luxuries. They are prerequisites for
agency.</p>

<p>Open access does not eliminate power imbalances. Compute, data, and expertise still
matter. But it preserves the possibility of independent action, which is often the
difference between participation and dependency.</p>

<h2 id="open-standards-and-shared-infrastructure">Open Standards and Shared Infrastructure</h2>

<p>Open source alone is not enough. Open standards define shared interfaces that allow
independently built systems to work together.</p>

<p>The web proved this model at global scale. By separating interfaces from
implementations, standards enabled competition without fragmentation. In AI,
standards around model formats, inference interfaces, evaluation, and data
documentation can lower switching costs and prevent ecosystems from hardening into
silos controlled by a few gatekeepers.</p>

<p>Without standards, “openness” risks collapsing into a collection of incompatible
artifacts, each tied to its own platform or service.</p>

<h2 id="looking-ahead">Looking Ahead</h2>

<p>Some infrastructure works best when treated as a common good. The web’s resilience
came from the fact that no single actor owned its foundations.</p>

<p>AI is on track to become similar infrastructure. The question is not whether it
will be powerful, but whether it will be governable.</p>

<p>If core models, datasets, and interfaces are only accessible through proprietary
APIs and cloud platforms, then “AI adoption” will mostly mean dependency. Choice
will be limited to pricing tiers, usage caps, and terms of service.</p>

<p>Not everything should be owned and monetized by a small number of companies.
Projects like Mozilla’s Common Voice show that shared assets can be built and
maintained in the open, at meaningful scale.</p>

<p>Shared infrastructure also depends on shared spaces. Platforms like Hugging Face
play a critical role by enabling collaboration around models, datasets, and tools,
and by lowering the barrier to participation in open AI ecosystems.</p>

<p>Open source and open standards are not about nostalgia or ideology. They are about
keeping the option to walk away. To inspect. To fork. To rebuild.</p>

<p>Once that option is gone, it is rarely recovered.</p>

<h2 id="references">References</h2>

<ul>
  <li>Open Source Initiative, <a href="https://opensource.org/ai/open-source-ai-definition">Open Source AI Definition</a></li>
  <li><a href="https://www.w3.org/groups/wg/ml">W3C Machine Learning Working Group</a></li>
  <li><a href="https://commonvoice.mozilla.org">Mozilla Common Voice</a></li>
  <li><a href="https://huggingface.co">Hugging Face</a></li>
</ul>]]></content><author><name></name></author><category term="ai" /><category term="open-source" /><category term="web" /><summary type="html"><![CDATA[Artificial intelligence is becoming a foundational layer of modern software. It is no longer confined to research labs, but embedded directly in everyday tools and user experiences.]]></summary></entry><entry><title type="html">rustnn - a Python and Rust Implementation of W3C WebNN aimed at Firefox</title><link href="https://tarekziade.github.io/2025/12/17/building-rustnn-webnn-implementation-rust/" rel="alternate" type="text/html" title="rustnn - a Python and Rust Implementation of W3C WebNN aimed at Firefox" /><published>2025-12-17T00:00:00+00:00</published><updated>2025-12-17T00:00:00+00:00</updated><id>https://tarekziade.github.io/2025/12/17/building-rustnn-webnn-implementation-rust</id><content type="html" xml:base="https://tarekziade.github.io/2025/12/17/building-rustnn-webnn-implementation-rust/"><![CDATA[<p>Over the past few weeks, I’ve been working on rustnn, a Rust implementation of
the W3C WebNN specification.</p>

<p>What started as an experiment to gain a deeper understanding of WebNN quickly
grew into something more substantial: a working implementation that is now very
close to being a usable library.</p>

<p>I began this project after returning from TPAC, convinced that WebNN is the
future of AI in the browser, and that Firefox needs to catch up with the work
that has already been done in Chromium.</p>

<p>We are likely still months away from matching the level of maturity Chromium has
achieved over several years of development. However, in just a few weeks I was
able to make significant progress thanks to a few key factors:</p>

<ul>
  <li>The WebNN specification is clear and well written</li>
  <li>The WPT conformance and validation tests are comprehensive</li>
  <li>End-to-end JavaScript demos exercise WebNN in realistic scenarios</li>
  <li>Chromium’s implementation has already surfaced many of the hard problems</li>
</ul>

<h1 id="claude-code">Claude Code</h1>

<p>All of these factors made it surprisingly easy to build the library quickly
using Claude Code. Once I had enumerated the 95 operators that needed to be
implemented, the workflow for each one was essentially the same:</p>

<ul>
  <li>use the specification to understand the operator</li>
  <li>grab the relevant WPT tests</li>
  <li>implement the operator in the CoreML and ONNX converters</li>
  <li>validate it against the ONNX and CoreML executors</li>
  <li>move on to the next operator</li>
</ul>

<p>Claude consistently performed well. I was able to build a library that would
normally have taken me months to write on my own. When something failed,
narrowing down the problem was straightforward by iterating between the
specification and the tests.</p>

<p>Because most of the work revolves around graph conversion and orchestrating
existing inference libraries, the code generated by Claude is generally clean
and easy to reason about.</p>

<p>The Chromium implementation was also a huge help when I started to get into 
weird corner cases, especially around CoreML. That code base has been 
developed over the years with people directly involved in the spec.</p>

<p>I have started adding performance tests, and there will likely be some manual
follow-up work, but reaching a functional implementation so quickly is
already a major milestone.</p>

<h1 id="why-rust">Why Rust?</h1>

<p>These days, adding a new API to Firefox usually means creating a Rust library
that is vendored into the tree and bound to Gecko using cbindgen, unless there
is an existing C++ library that already fits the need.</p>

<p>This gradual “oxidation” of Firefox started years ago, and major features such
as WebGPU have followed this model. Gecko is still a large C++ codebase, and
integrating a Rust library is not trivial, but implementing something like WebNN
outside the browser engine has a major advantage: it allows a much broader
community to contribute. We are already seeing the benefits of this approach
with wgpu.</p>

<p>I am not going to rehash the Rust vs. C++ debate. There is no shortage of
material on why Rust has become an attractive choice for systems programming.</p>

<p>My first instinct was to see whether Chromium’s WebNN implementation could be
reused. In practice, that turned out to be impractical. The code is deeply
intertwined with Blink and its IPC layers, making it very difficult to extract
reusable components in a clean way.</p>

<p>We also evaluated webnn-native, a C++ implementation developed within the Web
Machine Learning community. While promising, the project had been effectively
stalled for about two years and lacked support for the most recent inference
backends. Extending it was an option, but it quickly became clear that a fresh
Rust implementation would be both faster to iterate on and a better
architectural fit for Gecko.</p>

<p>In the end, this is good news for the Web and for WebNN. An independent
implementation helps validate the specification, exposes ambiguities earlier,
and ultimately makes the standard stronger.</p>

<p>Finally, building the core in Rust makes it trivial to expose a Python API on
top of it, which opens the door to experimentation and adoption by the broader
ML community.</p>

<h1 id="the-architecture">The architecture</h1>

<p>rustnn follows a key principle: <strong>graph compilation creates a
platform-independent representation; backend conversion happens at execution
time</strong>.</p>

<p>This follows the same logic as Chromium and is a great way to make sure
we can add more backends in the future.</p>

<pre><code class="language-mermaid">flowchart TD
  RustNN["RustNN"]

  WebNNGraph["WebNN Graph"]
  Executors["Executors"]

  Converter["Converter"]

  ONNXRuntime["onnx runtime"]
  CoreMLExec["CoreML"]
  TensorRT["TensorRT"]

  ONNXGraph["ONNX Graph"]
  CoreMLGraph["CoreML Graph"]

  RustNN --&gt; WebNNGraph
  RustNN --&gt; Executors

  WebNNGraph --&gt; Converter
  Converter --&gt; ONNXGraph
  Converter --&gt; CoreMLGraph

  Executors --&gt; ONNXRuntime
  Executors --&gt; CoreMLExec
  Executors --&gt; TensorRT

</code></pre>

<p>In the library, we do an initial pass on the WebNN graph to produce an intermediate
representation, then we pick a <em>converter</em> to turn that graph into another graph 
that can run with an AI library. And there are <em>executors</em> that can run the graph 
using those external libraries.</p>

<p>This is a very powerful design. For instance, I am playing with the TensorRT-RTX 
library that can be used to efficiently run AI on NVIDIA GPUs and that library 
has full support for ONNX graphs. This means we can run networks in rustnn using
the ONNX converter combined with the TensorRT executor.</p>

<h1 id="coreml-onnx-and-tensorrt">CoreML, ONNX and TensorRT</h1>

<p>I picked CoreML and ONNX as my first target runtimes because I work on a MacBook,
and because they are both implemented in Chromium.</p>

<p>Chromium uses ONNX on Windows because that library now ships with the latest
Windows 11, and it falls back to DirectML. It also has a CoreML implementation
on macOS.</p>

<p>So I went ahead and built both CoreML and ONNX as converters and executors until
I could make the image classification demo work with the Python binding.</p>

<p>Next, I started to add TensorRT as an executor for Windows with NVIDIA 
GPUs. That one is a work in progress because I have to work on another 
Windows computer and I am slower in that environment. But it’s technically 
already working. I started the <code class="language-plaintext highlighter-rouge">trtx-rs</code> Rust library to bind TensorRT, since the
existing Rust binding was 5 years old.</p>

<h1 id="pywebnn">PyWebNN</h1>

<p>rustnn exposes a Python binding (PyWebNN) that implements the W3C WebNN API on
top of the Rust core. You can use it for graph validation, conversion
(ONNX/CoreML) and execution of neural networks.</p>

<p>Installation:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Install from PyPI with bundled ONNX Runtime (v0.4.0+)</span>
pip <span class="nb">install </span>pywebnn

<span class="c"># Or build from source with all backends (ONNX + CoreML)</span>
git clone https://github.com/tarekziade/rustnn.git
<span class="nb">cd </span>rustnn
make python-dev
<span class="nb">source</span> .venv-webnn/bin/activate
</code></pre></div></div>

<p>Version 0.4.0+ includes bundled ONNX Runtime for immediate execution support.
No additional dependencies needed!</p>

<p>This is a very small example adapted from the repo example examples/python_matmul.py.</p>

<p>It shows the minimal flow:</p>

<ol>
  <li>create an ML instance and context</li>
  <li>create a graph builder</li>
  <li>define two constant tensors</li>
  <li>build a matmul node</li>
  <li>compile the graph</li>
  <li>run it</li>
</ol>

<p>Note: Use <code class="language-plaintext highlighter-rouge">accelerated=False</code> for CPU-only execution, or <code class="language-plaintext highlighter-rouge">accelerated=True</code> with <code class="language-plaintext highlighter-rouge">power_preference="high-performance"</code> for GPU acceleration.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># PyWebNN — tiny matmul example
</span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">webnn</span>

<span class="c1"># 1) create ML instance and context (CPU execution here)
</span><span class="n">ml</span> <span class="o">=</span> <span class="n">webnn</span><span class="p">.</span><span class="n">ML</span><span class="p">()</span>
<span class="n">ctx</span> <span class="o">=</span> <span class="n">ml</span><span class="p">.</span><span class="n">create_context</span><span class="p">(</span><span class="n">power_preference</span><span class="o">=</span><span class="s">"default"</span><span class="p">,</span> <span class="n">accelerated</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="c1"># 2) build a simple graph: Y = A @ B
</span><span class="n">builder</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">create_graph_builder</span><span class="p">()</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">2.</span><span class="p">],</span> <span class="p">[</span><span class="mf">3.</span><span class="p">,</span> <span class="mf">4.</span><span class="p">]],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mf">5.</span><span class="p">,</span> <span class="mf">6.</span><span class="p">],</span> <span class="p">[</span><span class="mf">7.</span><span class="p">,</span> <span class="mf">8.</span><span class="p">]],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>

<span class="n">a</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">constant</span><span class="p">(</span><span class="n">A</span><span class="p">)</span>    <span class="c1"># constant input A
</span><span class="n">b</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">constant</span><span class="p">(</span><span class="n">B</span><span class="p">)</span>    <span class="c1"># constant input B
</span><span class="n">y</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>   <span class="c1"># matmul node
</span>
<span class="n">graph</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">build</span><span class="p">({</span><span class="s">"output"</span><span class="p">:</span> <span class="n">y</span><span class="p">})</span>  <span class="c1"># compile the graph
</span>
<span class="c1"># 3) run the graph and print result
</span><span class="n">result</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="p">{})</span>  <span class="c1"># returns dict of outputs
</span><span class="k">print</span><span class="p">(</span><span class="s">"Y ="</span><span class="p">,</span> <span class="n">result</span><span class="p">[</span><span class="s">"output"</span><span class="p">])</span>
</code></pre></div></div>

<p>What happens:</p>

<ul>
  <li>ML() creates the entry point following the W3C WebNN spec</li>
  <li>create_context() creates a runtime context (choose CPU/GPU/NPU where supported)</li>
  <li>create_graph_builder() constructs the WebNN graph using familiar ops (constant, matmul, etc.)</li>
  <li>build() compiles the graph with named outputs (dict format)</li>
  <li>compute() runs it and returns the outputs as a dict</li>
</ul>

<h1 id="firefox">Firefox</h1>

<p>Paul Adenot is currently extending the Firefox AI Runtime platform to add
a new specialized process to run against GPUs for the WebSpeech implementation,
and the WebNN API will use it when it’s added in the browser.</p>

<p>In the meantime I have built a patch that adds the WebNN JS API in Firefox
and executes it directly in the content process, which is a big security hole.</p>

<p>But it was a good way to start figuring out all the pieces, in particular how
to bind the Rust library into the C++ layer using <code class="language-plaintext highlighter-rouge">cbindgen</code>, and how to create
the WebIDL interface to provide the JS API.</p>

<p>The current series of patches is just a proof of concept, but I already have
a fully functional demo of all basic operators, and a clone of the WebNN JS
MobileNetV2 image classifier demo — see the video.</p>

<video controls="" style="max-width: 100%; height: auto;" poster="/assets/panda.png" title="Classifying a panda">
  <source src="https://cotedorclassicjuniors.fr/webnn-demo.mov" type="video/quicktime" />
  Your browser does not support the video tag.
  <a href="https://cotedorclassicjuniors.fr/webnn-demo.mov">Download the video</a> instead.
</video>

<p>The WebNN implementation spans six distinct layers:</p>

<ol>
  <li><strong>JavaScript API Layer</strong> — Web-facing API (<code class="language-plaintext highlighter-rouge">navigator.ml</code>, <code class="language-plaintext highlighter-rouge">MLContext</code>, <code class="language-plaintext highlighter-rouge">MLGraphBuilder</code>, <code class="language-plaintext highlighter-rouge">MLGraph</code>, <code class="language-plaintext highlighter-rouge">MLOperand</code>, <code class="language-plaintext highlighter-rouge">MLTensor</code>)</li>
  <li><strong>WebIDL Layer</strong> — Interface definition language defining the JavaScript API surface</li>
  <li><strong>C++ DOM Implementation</strong> — Core implementation in <code class="language-plaintext highlighter-rouge">dom/webnn/</code></li>
  <li><strong>Rust FFI Bridge</strong> — Foreign Function Interface in <code class="language-plaintext highlighter-rouge">dom/webnn/rustnn_bridge/</code></li>
  <li><strong>rustnn Library</strong> — Rust implementation in <code class="language-plaintext highlighter-rouge">third_party/rust/rustnn/</code></li>
  <li><strong>Backend</strong> — Platform-specific backend (ONNX Runtime, CoreML, etc.) for neural network execution with hardware acceleration</li>
</ol>

<p>… and has the following flow:</p>

<p>Graph Building Phase:</p>

<ul>
  <li>Web content calls navigator.ml.createContext()</li>
  <li>C++ creates backend context via Rust FFI (ONNX Runtime or CoreML depending on platform)</li>
  <li>Web content creates MLGraphBuilder and defines operations</li>
  <li>Each operation creates an MLOperand representing the result</li>
  <li>Web content calls builder.build() with output operands</li>
  <li>C++ serializes operations to JSON and calls Rust FFI</li>
  <li>Rustnn converts the graph to backend-specific format (ONNX or CoreML)</li>
  <li>Backend creates an optimized execution session</li>
  <li>Graph ID is returned to web content as MLGraph</li>
</ul>

<pre><code class="language-mermaid">sequenceDiagram
  participant JS as "JavaScript"
  participant CPP as "C++ (MLGraphBuilder)"
  participant FFI as "Rust FFI Bridge"
  participant RUST as "Rustnn Library"

  JS-&gt;&gt;CPP: createContext()
  CPP-&gt;&gt;FFI: rustnn_context_create()
  FFI-&gt;&gt;RUST: Context::new()
  RUST--&gt;&gt;FFI: Context handle
  FFI--&gt;&gt;CPP: context_id
  CPP--&gt;&gt;JS: MLContext

  JS-&gt;&gt;CPP: new MLGraphBuilder(context)
  CPP--&gt;&gt;JS: MLGraphBuilder

  JS-&gt;&gt;CPP: input("x", shape, dataType)
  CPP--&gt;&gt;JS: MLOperand

  JS-&gt;&gt;CPP: add(a, b)
  CPP--&gt;&gt;JS: MLOperand

  JS-&gt;&gt;CPP: build({ output: operand })
  CPP-&gt;&gt;FFI: rustnn_graph_build(ops_json)
  FFI-&gt;&gt;RUST: GraphBuilder::build()

  Note right of RUST: Convert to backend format
  Note right of RUST: Create backend session

  RUST--&gt;&gt;FFI: Graph handle
  FFI--&gt;&gt;CPP: graph_id
  CPP--&gt;&gt;JS: MLGraph
</code></pre>

<p>Inference Phase:</p>

<ul>
  <li>Web content calls context.compute(graph, inputs, outputs)</li>
  <li>C++ marshals input data and calls Rust FFI with graph ID</li>
  <li>Rustnn retrieves the backend session and prepares input tensors</li>
  <li>Backend (ONNX Runtime or CoreML) executes the computational graph</li>
  <li>Hardware acceleration is automatically utilized when available</li>
  <li>Output tensors are returned through Rust FFI</li>
  <li>C++ copies output data to JavaScript-provided buffers</li>
  <li>Promise resolves, indicating inference completion</li>
</ul>

<pre><code class="language-mermaid">sequenceDiagram
  participant JS as "JavaScript"
  participant CPP as "C++ (MLContext)"
  participant FFI as "Rust FFI Bridge"
  participant RUST as "Rustnn Library"
  participant BE as "Backend (ONNX/CoreML)"

  JS-&gt;&gt;CPP: compute(graph, inputs, outputs)
  CPP-&gt;&gt;FFI: rustnn_graph_compute(graph_id, inputs, outputs)
  FFI-&gt;&gt;RUST: Graph::compute()
  RUST-&gt;&gt;BE: session.run()

  Note right of BE: Execute operations

  BE--&gt;&gt;RUST: Output tensors
  RUST--&gt;&gt;FFI: Results
  FFI--&gt;&gt;CPP: Output data
  CPP--&gt;&gt;JS: Promise resolves

</code></pre>

<p>Again, this is not the final design since we need to run inference in a
separate process and have an IPC layer between the C++ code and the Rust bridge.</p>

<h1 id="conclusion">Conclusion</h1>

<p>rustnn started as a way for me to really understand WebNN, but it quickly turned
into a convincing proof that the specification is solid, implementable, and
ready to grow beyond a single browser engine. Having an independent
implementation is healthy for the Web, and rustnn shows that WebNN can be built
as a reusable, backend-agnostic library rather than something deeply tied to a
single browser architecture.</p>

<p>This project is also my first substantial experience with Claude Code, and it
fundamentally changed the pace at which I could work. Implementing nearly a
hundred operators, wiring multiple backends, and validating everything against
WPT would normally be a multi-month effort. With a strong spec, good tests, and
a capable AI agent, it became an iterative and surprisingly enjoyable process.
The result is not throwaway code, but a clean foundation that can be extended,
optimized, and reviewed by others.</p>

<p>I am very optimistic about WebNN’s future in Firefox and on the Web in general.
With rustnn and pywebnn, my hope is to make it easier for browser engineers, ML
practitioners, and researchers to experiment, contribute, and push the ecosystem
forward. There is still a lot to do, especially around performance, security,
and process isolation, but the path forward is now much clearer.</p>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://github.com/tarekziade/rustnn">GitHub Repository</a></li>
  <li><a href="https://pypi.org/project/pywebnn/">PyPI Package</a></li>
  <li><a href="https://www.w3.org/TR/webnn/">W3C WebNN Specification</a></li>
  <li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=2005145">Firefox Integration Bug</a></li>
  <li><a href="https://tarekziade.github.io/rustnn/">rustnn Documentation</a></li>
  <li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=2005145">Firefox POC patch</a></li>
  <li><a href="https://github.com/tarekziade/trtx-rs">TensorRT Rust lib</a></li>
  <li><a href="https://cotedorclassicjuniors.fr/webnn-demo.mov">Firefox Video Demo</a></li>
  <li><a href="https://github.com/webmachinelearning/webnn-native">Legacy WebNN native</a></li>
</ul>]]></content><author><name>Tarek Ziade</name></author><category term="WebNN" /><category term="Rust" /><category term="Firefox" /><category term="AI" /><category term="Machine Learning" /><summary type="html"><![CDATA[Over the past few weeks, I’ve been working on rustnn, a Rust implementation of the W3C WebNN specification.]]></summary></entry><entry><title type="html">All I Want for Christmas is a Better Alt Text – Part 2</title><link href="https://tarekziade.github.io/2025/12/16/better-alt-text-part-2/" rel="alternate" type="text/html" title="All I Want for Christmas is a Better Alt Text – Part 2" /><published>2025-12-16T00:00:00+00:00</published><updated>2025-12-16T00:00:00+00:00</updated><id>https://tarekziade.github.io/2025/12/16/better-alt-text-part-2</id><content type="html" xml:base="https://tarekziade.github.io/2025/12/16/better-alt-text-part-2/"><![CDATA[<p>In Part 1, I explained why high-quality alt text matters, how modern
vision–language models can help, and why balanced, carefully curated datasets
are essential for training.</p>

<p>In this second part, I focus on architecture. I explain why I decided to move
away from my initial design, what I learned from that first implementation, and
why I ultimately settled on a prefix-conditioning + LoRA approach.</p>

<p>This choice is driven by practical constraints. For alt-text generation, the
goal is not exhaustive visual understanding, but a short, reliable sentence that
conveys the essence of an image to visually impaired users. Within that scope,
prefix conditioning offers a much simpler model that is easier to train, easier
to deploy, and better aligned with accessibility requirements.</p>

<p>More broadly, the PDF.js alt-text project aims to explore how far we can push
small, efficient vision–language models for accessibility use cases. Rather than
optimizing for peak benchmark scores, the focus is on reliability, fast
iteration cycles, limited compute, and deployable models.</p>

<p>DistilViT is intentionally constrained. Smaller models, fewer trainable
parameters, and simpler architectures make it possible to experiment rapidly,
control bias more carefully through dataset curation, and realistically target
on-device or near-device inference scenarios.</p>

<h1 id="what-i-started-with-a-classic-encoderdecoder-model">What I started with: a classic encoder–decoder model</h1>

<p>My first implementation relied on Hugging Face’s <code class="language-plaintext highlighter-rouge">VisionEncoderDecoderModel</code>. Concretely, it paired:</p>

<ul>
  <li>a ViT-based vision encoder, and</li>
  <li>a GPT-2–style decoder (<code class="language-plaintext highlighter-rouge">distilgpt2</code>),</li>
</ul>

<p>trained end-to-end using <code class="language-plaintext highlighter-rouge">Seq2SeqTrainer</code>.</p>

<p>Conceptually, the architecture looked like this:</p>

<pre><code class="language-mermaid">flowchart TD
    A[Image] --&gt; B["Vision Encoder (ViT)"]
    B --&gt; C[Encoder hidden states]
    C --&gt;|Cross-attention| D["Decoder (GPT-2)"]
    D --&gt; E[Caption]
</code></pre>

<p>This worked. GPT-2 generated captions, and the system was usable. I was inspired by
<a href="https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers">The Illustrated Image Captioning Using Transformers</a>,
followed that recipe, and reduced the decoder size by using a distilled version of GPT-2.</p>

<p>What I did not fully appreciate at the time was what choosing GPT-2 implied under the hood.</p>

<p>Unlike T5 or BART, GPT-2 is a <strong>decoder-only</strong> language model. In its original architecture, it does not support cross-attention or encoder hidden states.</p>

<p>So why did this setup work?</p>

<p>Because <code class="language-plaintext highlighter-rouge">VisionEncoderDecoderModel.from_encoder_decoder_pretrained()</code> wraps GPT-2 and injects <strong>cross-attention layers</strong>. This effectively converts GPT-2 into a seq2seq-style decoder by adding encoder–decoder attention blocks and routing the vision encoder outputs through them.</p>

<p>That distinction matters. These cross-attention layers are initialized from scratch, require substantial training signal, and introduce additional state to manage at inference time. Exporting the model and handling caching also become more complex.</p>

<p>This approach is valid, but it turned out to be architecturally heavier than expected for the scale and goals of this project. Training was slower, GPU memory usage was higher, and deployment friction increased.</p>

<p>Models like T5 or BART avoid this injection step because they already contain pretrained cross-attention blocks. However, those blocks were trained to attend to text encoder states and still require fine-tuning to adapt properly to vision features.</p>

<p>At that point, I started looking for an alternative and came across <strong>prefix conditioning</strong>.</p>

<h1 id="cross-attention-vs-prefix-conditioning">Cross-attention vs prefix conditioning</h1>

<p>It is worth stepping back and comparing these two approaches without framing one
as universally superior.</p>

<p><strong>Cross-attention</strong> gives the decoder continuous access to visual features at
every generation step. This is extremely powerful for tasks that require
fine-grained spatial grounding, OCR, counting, or reasoning over multiple
regions in an image.</p>

<p><strong>Prefix conditioning</strong>, by contrast, injects visual information once, as a
sequence of projected vision tokens prepended to the text embeddings at the
beginning of the text. After that, the model relies entirely on standard
self-attention.</p>

<p>This leads to clear trade-offs:</p>

<ul>
  <li>Cross-attention provides stronger and more precise grounding.</li>
  <li>Prefix conditioning trades some of that precision for architectural simplicity.</li>
</ul>

<p>For my use case, this trade-off is appropriate. The goal of alt text here is not to enumerate details or perform spatial reasoning, but to produce a <strong>single, concise sentence</strong> that conveys the overall content of an image to visually impaired users. Captions are short, factual, and descriptive, and they primarily require global visual context rather than continuous visual querying.</p>

<p>Under these conditions, prefix conditioning is often sufficient, while being far easier to train, debug, and deploy than a full encoder–decoder setup.</p>

<h1 id="prefix-conditioning-with-lora">Prefix conditioning with LoRA</h1>

<p>The architecture I use now looks like this:</p>

<pre><code class="language-mermaid">flowchart TD
    A[Image] --&gt; B["SigLIP Vision Encoder (frozen)"]
    B --&gt; C["Projection Head (Linear / MLP)"]
    C --&gt; D[Vision embeddings as prefix tokens]
    D --&gt; E[Decoder-only LM with LoRA]
    E --&gt; F["Caption (25–30 tokens)"]
</code></pre>

<p>Instead of asking the decoder to attend to an encoder, I inject the visual information directly into the decoder’s input space as prefix tokens.</p>

<ul>
  <li>No cross-attention</li>
  <li>No encoder–decoder coupling</li>
  <li>Just conditioning</li>
</ul>

<p>The language model only needs standard causal self-attention. Any decoder-only LLM works out of the box, without architectural changes or special forward signatures.</p>

<p>This restores flexibility. I can swap language models freely without touching the vision side.</p>

<p>I apply <strong>LoRA adapters</strong> to the language model’s attention projection matrices.</p>

<ul>
  <li>The base language model remains frozen</li>
  <li>The vision encoder remains frozen</li>
  <li>Only the projection head and LoRA adapters are trained</li>
</ul>

<p>In practice, this means:</p>

<ul>
  <li>~221M total parameters</li>
  <li>~2.2M trainable parameters</li>
  <li>Roughly 1 percent of the model updated</li>
</ul>

<p>Training is faster, more stable, and far less memory-intensive. The risk of overfitting drops significantly when working with small datasets.</p>

<p>Deployment also becomes simpler.</p>

<ul>
  <li>The vision encoder exports cleanly to ONNX</li>
  <li>The projection head is trivial</li>
  <li>The decoder is a standard causal LM with past key values</li>
</ul>

<p>There is no cross-attention graph, no encoder cache plumbing, and no exotic export logic. ONNX Runtime becomes a realistic target instead of a constant source of friction.</p>

<h1 id="summary">Summary</h1>

<p>Cross-attention remains a powerful and sometimes necessary tool. For this project, however, it added complexity without delivering better alt text.</p>

<p>Prefix conditioning gives me:</p>

<ul>
  <li>a simpler architecture</li>
  <li>faster iteration</li>
  <li>better tooling compatibility</li>
  <li>easier deployment</li>
  <li>freedom to use modern decoder-only models</li>
</ul>

<p>Initial experiments show that the new architecture produces alt text of
comparable quality to the previous one, with only a 1 to 2 percent CLIP score
difference when trained on the same datasets. The key difference is training
speed, which is roughly five times faster.</p>

<p>Next, my goal is to surpass DistilViT’s current quality by improving the
training dataset, while keeping an architecture that is simple, fast to train,
and flexible enough to accommodate future decoder models.</p>

<h1 id="references">References</h1>

<ul>
  <li>Mokady et al., <em><a href="https://arxiv.org/abs/2111.09734">ClipCap: CLIP Prefix for Image Captioning</a></em>, 2021</li>
  <li>Li &amp; Liang, <em><a href="https://arxiv.org/abs/2101.00190">Prefix-Tuning: Optimizing Continuous Prompts for Generation</a></em>, 2021</li>
  <li>Hu et al., <em><a href="https://arxiv.org/abs/2106.09685">LoRA: Low-Rank Adaptation of Large Language Models</a></em>, 2021</li>
  <li>Hugging Face, <em><a href="https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder">VisionEncoderDecoderModel documentation</a></em></li>
</ul>

<h1 id="useful-links">Useful Links</h1>

<ul>
  <li><a href="/2025/12/15/better-alt-text-part-1/">Part 1: Dataset Quality and Bias Detection</a></li>
  <li><a href="https://huggingface.co/google/siglip-base-patch16-224">SigLIP on Hugging Face</a></li>
  <li><a href="https://huggingface.co/HuggingFaceTB/SmolLM-135M">SmolLM on Hugging Face</a></li>
  <li><a href="https://github.com/tarekziade/distilvit2">DistilViT2 code</a></li>
</ul>]]></content><author><name>Tarek Ziade</name></author><category term="AI" /><category term="Machine Learning" /><category term="Image Captioning" /><category term="LoRA" /><category term="Prefix Conditioning" /><category term="SigLIP" /><category term="SmolLM" /><category term="Mozilla" /><category term="Accessibility" /><summary type="html"><![CDATA[In Part 1, I explained why high-quality alt text matters, how modern vision–language models can help, and why balanced, carefully curated datasets are essential for training.]]></summary></entry><entry><title type="html">All I Want for Christmas is a Better Alt Text – Part 1</title><link href="https://tarekziade.github.io/2025/12/15/better-alt-text-part-1/" rel="alternate" type="text/html" title="All I Want for Christmas is a Better Alt Text – Part 1" /><published>2025-12-15T00:00:00+00:00</published><updated>2025-12-15T00:00:00+00:00</updated><id>https://tarekziade.github.io/2025/12/15/better-alt-text-part-1</id><content type="html" xml:base="https://tarekziade.github.io/2025/12/15/better-alt-text-part-1/"><![CDATA[<h1 id="context-improving-alt-text-for-firefox">Context: Improving Alt Text for Firefox</h1>

<p>Earlier this year, I built the backend for the <a href="https://hacks.mozilla.org/2024/05/experimenting-with-local-alt-text-generation-in-firefox-nightly/">local alt text generation feature in Firefox</a>. Nearly half of the images on the web still lack alternative text, creating a major accessibility barrier for screen reader users. The goal of this work is straightforward but ambitious: generate high-quality alt text entirely on device, preserving user privacy while improving access to visual content.</p>

<p>The first implementation focused on PDF.js, primarily as a controlled environment to validate the approach. Now that the runtime performance is good enough, the next step is to generalize this capability across the entire browser so that all web images can benefit from meaningful descriptions. Before that generalization, however, improving accuracy is essential.</p>

<p>From a modeling perspective, the system pairs a Vision Transformer (ViT) with DistilGPT-2, a 182-million-parameter language model that fits under 200 MB once quantized. Improving this system involves multiple, often competing dimensions: <strong>bias reduction</strong>, <strong>description accuracy</strong>, and <strong>inference speed</strong>. This post focuses on the data side of the problem, specifically dataset quality and bias. Part 2 will look at model-level improvements for accuracy and performance.</p>

<h1 id="first-round-removing-bias-with-gpt-4o">First Round: Removing Bias with GPT-4o</h1>

<p>The original image captions contained several recurring issues:</p>

<ul>
  <li><strong>Gender bias</strong>: skateboarders described as “men”, nurses as “women”</li>
  <li><strong>Age stereotyping</strong>: unnecessary or reductive age descriptors</li>
  <li><strong>Offensive or outdated language</strong>: culturally insensitive terms that no longer belong in a modern dataset</li>
</ul>

<p>To address this, I used GPT-4o to systematically transform captions from Flickr30k and COCO, removing demographic descriptors that were not visually required. The resulting datasets are available on Hugging Face (<a href="https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o">Mozilla/flickr30k-transformed-captions-gpt4o</a>) and were used to train the current Firefox local alt text model.</p>

<p>For more background on this initial effort, see the <a href="https://hacks.mozilla.org/2024/05/experimenting-with-local-alt-text-generation-in-firefox-nightly/">Mozilla Hacks post</a> and the <a href="https://blog.mozilla.org/en/firefox/firefox-ai/help-us-improve-our-alt-text-generation-model/">Firefox blog announcement</a>. <strong>This is the model that is currently shipping in Firefox.</strong></p>

<h1 id="second-round-measuring-what-actually-improved">Second Round: Measuring What Actually Improved</h1>

<p>Qualitative panel testing showed that the transformed captions were generally better received by humans, but that only answered part of the question. What exactly improved, by how much, and what problems remained hidden in the data?</p>

<p>This post documents the second round of work, which focused on building systematic measurement tools to:</p>

<ol>
  <li>Quantify how much bias was actually removed</li>
  <li>Verify that transformed captions still describe the images accurately</li>
  <li>Identify class imbalance and other structural issues</li>
  <li>Lay the groundwork for targeted fixes, including synthetic data generation</li>
</ol>

<p>When training vision-language models, dataset quality is often treated as a secondary concern compared to architecture or training tricks. In practice, the data is the foundation. If the dataset is biased, noisy, or unbalanced, no amount of fine-tuning will fully compensate.</p>

<h1 id="the-problem-space">The Problem Space</h1>

<p>After the GPT-4o transformation, several open questions remained:</p>

<ul>
  <li>Did bias removal actually work in a measurable way?</li>
  <li>Was semantic meaning preserved during transformation?</li>
  <li>Did image–text alignment degrade or improve?</li>
  <li>Are some visual concepts severely underrepresented?</li>
  <li>Can these checks be repeated reliably for future dataset versions?</li>
</ul>

<p>Answering these questions requires more than a single score or benchmark.</p>

<h1 id="a-multi-metric-quality-analysis">A Multi-Metric Quality Analysis</h1>

<p>I built a dataset quality analysis tool that evaluates four complementary dimensions. The emphasis is on improving the training data itself, rather than compensating for data issues at model time.</p>

<h2 id="1-imagetext-alignment-clip-score">1. Image–Text Alignment (CLIP Score)</h2>

<p>CLIP provides a convenient proxy for how well a caption matches its corresponding image. By embedding both modalities and computing cosine similarity, I obtain a rough but useful alignment score.</p>

<p>A key improvement in this round was upgrading from CLIP ViT-B/32 to ViT-L/14 @ 336 px. The larger model produces lower absolute scores, but it is significantly more discriminative, making it easier to separate strong alignments from weak ones.</p>

<p><strong>Interpretation guidelines</strong>:</p>

<ul>
  <li>Excellent: ≥ 0.35</li>
  <li>Good: 0.30–0.35</li>
  <li>Fair: 0.25–0.30</li>
  <li>Poor: &lt; 0.25</li>
</ul>

<p>On the transformed dataset, I observe scores of <strong>0.311</strong> with ViT-B/32 (Good) and <strong>0.284</strong> with ViT-L/14 @ 336 px (Fair but more informative).</p>

<h2 id="2-caption-fidelity-bertscore">2. Caption Fidelity (BERTScore)</h2>

<p>Removing bias should not come at the cost of semantic drift. To verify this, I used BERTScore with a RoBERTa-large backbone to compare original and transformed captions.</p>

<p>Scores above 0.90 generally indicate that the core meaning is preserved. The transformed dataset achieves <strong>0.904</strong>, which falls comfortably in the “excellent” range.</p>

<h2 id="3-bias-detection-before-and-after">3. Bias Detection Before and After</h2>

<p>Bias reduction is only meaningful if it can be measured. I tracked mentions of protected attributes across seven categories, including gender, race or ethnicity, nationality, age, religion, sexual orientation, and disability.</p>

<p>By comparing original and transformed captions on the same samples, I can directly quantify the effect of the transformation. On a 1 000-sample evaluation set, gender mentions dropped from 70 percent to zero, race and ethnicity mentions dropped by 97 percent, and nationality mentions were completely eliminated. Age-related terms remain more common, largely because they are often visually relevant, for example when describing children.</p>

<h2 id="4-object-distribution-and-imbalance">4. Object Distribution and Imbalance</h2>

<p>Finally, I analyzed object frequency to identify long-tail problems. Using metrics such as the Gini coefficient and Shannon entropy, the tool highlights severe imbalance: thousands of objects appear only a handful of times.</p>

<p>This analysis automatically produces lists of rare objects and sampling weights that can be used for rebalancing during training.</p>

<h1 id="using-clip-as-a-training-signal">Using CLIP as a Training Signal</h1>

<p>Beyond evaluation, CLIP can also be used to guide training directly. I experimented with a combined loss that adds a CLIP-based alignment term to the standard cross-entropy loss for caption generation.</p>

<p>The intuition is simple: encourage the model to generate captions that are not only fluent, but also visually grounded. Early results suggest modest but consistent gains in CLIP score, at the cost of slower training and higher memory usage.</p>

<h1 id="running-the-quality-analysis">Running the Quality Analysis</h1>

<p>The quality analysis tool integrates directly into the project’s Makefile:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Quick test (100 samples)</span>
make quality-report-quick

<span class="c"># Full analysis on test split</span>
make quality-report <span class="nv">SPLIT</span><span class="o">=</span><span class="nb">test</span>

<span class="c"># Custom analysis</span>
make quality-report <span class="nv">SPLIT</span><span class="o">=</span>train <span class="nv">MAX_SAMPLES</span><span class="o">=</span>1000 <span class="nv">OUTPUT_DIR</span><span class="o">=</span>./my_reports
</code></pre></div></div>

<h1 id="example-dataset-quality-report">Example Dataset Quality Report</h1>

<p>Below is an excerpt from the generated quality report for the full Flickr30k transformed dataset. It illustrates how the metrics come together in practice.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>================================================================================
                             DATASET QUALITY REPORT
================================================================================

Dataset: Mozilla/flickr30k-transformed-captions-gpt4o
Samples: 31 014

IMAGE–TEXT ALIGNMENT (CLIP)
Score: 0.274 ± 0.036   Assessment: FAIR

CAPTION FIDELITY (BERTScore)
Score: 0.899 ± 0.023   Assessment: GOOD

BIAS DETECTION (Original → Transformed)
Gender:         67% → 0%
Race/Ethnicity: 27% → 1%
Nationality:     1% → 0%
Age:            19% → 17%

OBJECT DISTRIBUTION
Gini coefficient: 0.889
Rare classes (&lt;50 samples): 6 210
================================================================================
</code></pre></div></div>

<p>The report confirms that the GPT-4o transformation is highly effective at removing demographic bias while preserving meaning. At the same time, it surfaces two remaining issues: only fair image–text alignment and severe class imbalance.</p>

<h1 id="output-files">Output Files</h1>

<p>The analysis produces the following artifacts:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Directory: quality_reports/
  • summary.json                 - Aggregate metrics in JSON format
  • quality_report.txt           - Human-readable summary report
  • per_example_scores.csv       - Per-sample CLIP, BERT, and bias scores
  • ranked_by_combined.csv       - Samples ranked by combined quality score
  • object_counts.csv            - Object frequency distribution
  • objects_below_50.csv         - Rare / underrepresented objects (≤50 samples)
  • reweighting_probs.csv        - Sampling probabilities for balanced training
  • lorenz_curve.png             - Object distribution inequality visualization
  • top_failures/                - Top failure cases with images and captions
</code></pre></div></div>

<p>These artifacts make it easy to audit dataset quality, compare runs, and target specific weaknesses.</p>

<h1 id="key-takeaways">Key Takeaways</h1>

<ul>
  <li>Dataset quality cannot be captured by a single metric</li>
  <li>Bias removal can be measured and verified quantitatively</li>
  <li>Larger CLIP models are more useful for discrimination, even if absolute scores are lower</li>
  <li>Alignment-aware training objectives show promise</li>
  <li>Class imbalance remains a major, and solvable, issue</li>
</ul>

<h1 id="what-comes-next">What Comes Next</h1>

<p>None of these improvements are shipping yet. They are preparatory steps that make future work safer and more predictable. With solid metrics in place, the next phase is to train improved models, validate gains rigorously, and continue reducing long-tail failures.</p>

<p>The long-term goal remains unchanged: provide high-quality, privacy-preserving alt text for the large fraction of web images that still lack it, and do so in a way that is fair, transparent, and measurable.</p>

<h1 id="references-and-resources">References and Resources</h1>

<h2 id="background">Background</h2>
<ul>
  <li><a href="https://hacks.mozilla.org/2024/05/experimenting-with-local-alt-text-generation-in-firefox-nightly/">Experimenting with local alt text generation in Firefox Nightly</a></li>
  <li><a href="https://blog.mozilla.org/en/firefox/firefox-ai/help-us-improve-our-alt-text-generation-model/">Help us improve our alt text generation model</a></li>
</ul>

<h2 id="datasets">Datasets</h2>
<ul>
  <li><a href="http://web.engr.illinois.edu/~bplumme2/Flickr30kEntities/">Flickr30k Entities Dataset</a></li>
  <li><a href="https://cocodataset.org/">COCO Dataset</a></li>
  <li><a href="https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o">Mozilla Flickr30k transformed captions (GPT-4o)</a></li>
</ul>

<h2 id="metrics">Metrics</h2>
<ul>
  <li><a href="https://arxiv.org/abs/2103.00020">CLIP: Learning Transferable Visual Models From Natural Language Supervision</a></li>
  <li><a href="https://arxiv.org/abs/1904.09675">BERTScore: Evaluating Text Generation with BERT</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Gini_coefficient">Gini coefficient</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">Shannon entropy</a></li>
</ul>

<h2 id="code">Code</h2>
<ul>
  <li><a href="https://github.com/tarekziade/distilvit2">Project repository</a></li>
  <li><a href="https://github.com/tarekziade/distilvit2/blob/main/docs/dataset_quality.md">Dataset quality documentation</a></li>
  <li><a href="https://github.com/tarekziade/distilvit2/blob/main/distilvit/dataset_quality_report.py">Quality analysis tool</a></li>
</ul>]]></content><author><name>Tarek Ziade</name></author><category term="AI" /><category term="Machine Learning" /><category term="Image Captioning" /><category term="Dataset Quality" /><category term="Bias Detection" /><category term="CLIP" /><category term="BERT" /><category term="Mozilla" /><category term="Accessibility" /><summary type="html"><![CDATA[Context: Improving Alt Text for Firefox]]></summary></entry><entry><title type="html">Two Years of Building AI in Firefox</title><link href="https://tarekziade.github.io/2025/12/05/two-years-of-ai-at-mozilla/" rel="alternate" type="text/html" title="Two Years of Building AI in Firefox" /><published>2025-12-05T00:00:00+00:00</published><updated>2025-12-05T00:00:00+00:00</updated><id>https://tarekziade.github.io/2025/12/05/two-years-of-ai-at-mozilla</id><content type="html" xml:base="https://tarekziade.github.io/2025/12/05/two-years-of-ai-at-mozilla/"><![CDATA[<p>When I started working on AI at Mozilla two years ago, I was a Python developer
with a background in web services and three months of machine learning experience
from working on the Nuclia DB project. I was not someone who had trained models
from scratch or built production ML infrastructure. Today, Firefox ships
multiple AI features that run entirely on-device, and I helped build the
infrastructure that makes that possible. This is a retrospective on what we
accomplished and what I learned along the way.</p>

<h2 id="building-the-foundation-the-ml-inference-runtime">Building the Foundation: The ML Inference Runtime</h2>

<p>The first major challenge was creating a runtime that could run machine learning
models directly in Firefox. We needed something that worked across platforms,
respected user privacy, and didn’t require sending data to external servers.</p>

<p>We built the Firefox ML inference engine on top of two core technologies: the
ONNX runtime for executing models, and Transformers.js to simplify the
inference work. The architecture we settled on uses a dedicated content process
for inference, keeping it isolated from the main browser process. Remote
Settings distributes both the runtime and model configurations, while IndexedDB
caches downloaded models locally.</p>

<p>One critical evolution was moving away from WebAssembly to run a pure C++ ONNX
runtime under Transformers.js. This shift gave us significantly better
performance and tighter integration with Firefox’s internals. Getting this
right required deep systems-level work, and I was fortunate to work with
fantastic engineers like Paul Adenot and Serge Guelton who brought the
expertise needed to make it happen.</p>

<p>This multi-process design was crucial. It gave us stability, security, and the
ability to update models without shipping new browser versions. We also created
our own model hub, giving us control over model distribution while still
supporting Hugging Face for developers who want broader model selection.</p>

<p>The API we exposed is deliberately simple. Developers create an engine instance
with a task name and model ID, then run inference either synchronously or with
streaming output. Behind the scenes, Firefox handles downloading models,
managing cache, and choosing the right backend.</p>

<h2 id="the-first-real-project-pdfjs-alt-text">The First Real Project: PDF.js Alt Text</h2>

<p>With the runtime in place, we needed a real feature to prove it worked. PDF.js
alt text generation became that first end-to-end project, and I have written
about it in detail before. But looking back now, it was more than just a
feature. It was the template for everything that came after.</p>

<p>We chose a Vision Transformer paired with a distilled GPT-2 decoder, compressed
to 180 million parameters and under 200MB on disk. The model runs in a couple
of seconds on a laptop, generates descriptions locally, and never sends your
PDF content anywhere. This shipped in Firefox 130, and it set the standard for
how we approach AI: small models, local execution, and privacy by default.</p>

<p>The harder work was not the model architecture. It was dealing with biased
training data and building a validation pipeline. COCO and Flickr30k datasets
carried gender stereotypes and cultural assumptions. We rebuilt the dataset
using GPT-4o annotations to generate cleaner, more neutral captions. Then we
built a human-in-the-loop validation app where users could correct outputs,
feeding those corrections back into retraining. That iterative cycle was what
made the model genuinely useful.</p>

<h2 id="smart-tab-management-and-beyond">Smart Tab Management and Beyond</h2>

<p>Once the runtime was stable and we had proven we could ship a real feature, the
next step was expanding to other use cases. Smart Tabs launched in Firefox 141,
bringing local AI to tab management.</p>

<p>The feature is simple: right-click a tab group, select “Suggest more tabs for
group,” and Firefox analyzes tab titles and descriptions to suggest similar
tabs. Users can accept or reject suggestions. The AI runs entirely on-device,
so your browsing data stays private.</p>

<p>This project showed that the infrastructure we built was flexible enough to
handle different tasks. Smart Tabs did not require a new runtime or a new model
distribution system—it reused what we already had. That reusability was proof
the architecture was working.</p>

<p>After Smart Tabs, we added many other small features following the same
pattern: laser-focused models running on-device for specific tasks. Each one
reinforced the core principle: AI should solve real problems without
compromising privacy. The infrastructure we built made it cheap to ship new
capabilities, and the local-first approach meant users stayed in control of
their data.</p>

<h2 id="ai-window-and-the-server-side-challenge">AI Window and the Server-Side Challenge</h2>

<p>The reality is that not all AI features can run locally. Small, specialized
models work well on-device, but larger language models (the kind that can handle
complex conversations and broad knowledge tasks) still need server-side compute.
That is where AI Window comes in.</p>

<p>Announced in November 2025, AI Window is an opt-in feature that brings a
conversational AI assistant directly into Firefox. Unlike our local features,
this required building infrastructure to support server-side inference while
maintaining Firefox’s commitment to user choice and control.</p>

<p>Over the past several months, I have been working on the server-side LLM
service and the overall architecture to make sure Firefox can reliably call
external services when needed. This meant designing APIs, handling failures
gracefully, managing rate limits, and ensuring the system could scale while
still respecting user preferences. The work was less about the models
themselves and more about building the bridge between Firefox and external AI
providers in a way that gives users real control.</p>

<p>This hybrid approach (local AI for privacy-sensitive tasks, server-side AI for
compute-intensive ones) is where the browser needs to go. But it raises
important questions about privacy.</p>

<h2 id="the-privacy-challenge-for-server-side-ai">The Privacy Challenge for Server-Side AI</h2>

<p>Local AI gives you perfect privacy: your data never leaves your device. But
when a model runs on a server, you are trusting someone else with your prompts,
your documents, and your questions. That trust model needs to change.</p>

<p>I am looking forward to industry standards around end-to-end encryption for
running LLM inference with full privacy guarantees. The technology already
exists. Flower.ai has built federated learning infrastructure with end-to-end
encryption that allows large models to run on remote GPUs while keeping user
data encrypted. Nvidia has Confidential Computing on H100 and Blackwell GPUs,
using hardware-based trusted execution environments to protect code and data
during inference. The performance overhead is minimal (often less than 5%) and
the privacy guarantees are real.</p>

<p>But here is the problem: none of this is part of the de facto OpenAI API
standard that most LLM services use today. If you want to call GPT-4 or Claude
or any major hosted model, there is no standardized way to do it with
end-to-end encryption or confidential compute guarantees. Your data goes to the
server in plaintext, and you have to trust the provider’s privacy policy.</p>

<p>My hope is that it will soon be possible to run inference on the cloud with
strong privacy guarantees as a standard feature, not a niche offering. The
hardware is ready. The cryptographic techniques exist. What we need now is for
the industry to adopt these capabilities as table stakes for AI services. Until
that happens, local AI remains the gold standard for privacy, and server-side
AI remains a compromise.</p>

<h2 id="what-made-this-possible">What Made This Possible</h2>

<p>Building AI features in a browser is not the same as building AI features in a
standalone app or a cloud service. The constraints are different. You have
limited resources, strict privacy requirements, and the need to work across
Windows, macOS, and Linux. Here is what made it work:</p>

<ul>
  <li>
    <p><strong>Starting small</strong>: We did not try to build everything at once. The first
runtime was minimal. The first model was simple. We added complexity only
when we needed it.</p>
  </li>
  <li>
    <p><strong>Privacy as a requirement, not a feature</strong>: Every decision started with “can
this run locally?” If the answer was no, we either changed the approach or
did not build it.</p>
  </li>
  <li>
    <p><strong>Reusable infrastructure</strong>: We built the runtime once and used it for
multiple features. That meant each new AI capability got cheaper to ship.</p>
  </li>
  <li>
    <p><strong>Learning from real users</strong>: The validation app for PDF.js alt text was not
just about improving the model—it was about understanding what real people
needed. User feedback drove every iteration.</p>
  </li>
</ul>

<h2 id="what-i-learned">What I Learned</h2>

<p>Two years ago, I did not know how to train a model or what ONNX was. Now I have
shipped multiple AI features in production. Here is what stuck with me:</p>

<ul>
  <li>
    <p><strong>You do not need a PhD</strong>: Machine learning has a reputation for being
inaccessible, but the tools have gotten good enough that you can learn by
doing. I started with a pre-trained model, fine-tuned it, and kept iterating.
Most of the work was engineering, not research.</p>
  </li>
  <li>
    <p><strong>Data quality beats model size</strong>: We spent more time cleaning datasets and
handling bias than we did optimizing model architecture. A smaller model
trained on better data outperformed a larger model trained on messy data.</p>
  </li>
  <li>
    <p><strong>Privacy is possible</strong>: The narrative around AI assumes everything needs to
run in the cloud. It does not. Local models work. They are fast enough, small
enough, and private by default.</p>
  </li>
  <li>
    <p><strong>Building the process matters more than building the model</strong>: The validation
pipeline, the retraining loop, the distribution system. That infrastructure
was more important than any single model.</p>
  </li>
</ul>

<h2 id="what-is-next">What is Next</h2>

<p>This work is not finished. We plan to iterate on PDF.js alt text, expand
Smart Tabs, and bring AI Window to users who want conversational AI in their
browser. WebNN is coming, and that will give us even better performance for
local models. The Firefox ML runtime is still experimental, but it is stable
enough that other teams are starting to build on it.</p>

<p>The bigger challenge is pushing the industry toward privacy-preserving
server-side AI. Confidential compute and end-to-end encryption for LLM
inference should not be experimental features. They should be the default. I
hope to see more providers adopt these technologies and for standards bodies to
make privacy guarantees a core part of the AI API specifications.</p>

<p>On a personal level, these two years showed me that AI in the browser is not
just possible—it is the right way to do it. Local models give users control.
They protect privacy. And they prove that you do not need to send your data to
a server farm to get intelligent features. But when you do need server-side
compute, it should come with strong privacy guarantees, not just promises.</p>

<p>What excites me the most is running AI locally. That is where the future of
open AI lies: not just open models and open weights, but truly open AI that runs
on your device, under your control, without gatekeepers or surveillance. The
browser is the perfect platform to make that future real.</p>

<p>I am proud of what we built. More importantly, I am excited about what comes next.</p>

<h2 id="useful-links">Useful links</h2>

<h3 id="firefox-features">Firefox Features</h3>
<ul>
  <li><a href="https://firefox-source-docs.mozilla.org/toolkit/components/ml/index.html">Firefox ML Documentation</a></li>
  <li><a href="https://blog.mozilla.org/en/firefox/ai-window/">Mozilla Blog: AI Window</a></li>
  <li><a href="https://blog.mozilla.org/en/firefox/firefox-ai/help-us-improve-our-alt-text-generation-model">Mozilla Blog: Help us improve our alt-text generation model</a></li>
  <li><a href="http://hacks.mozilla.org/2024/05/experimenting-with-local-alt-text-generation-in-firefox-nightly/">Mozilla Hacks: Experimenting with Local Alt-Text Generation</a></li>
  <li><a href="https://blog.mozilla.org/en/mozilla/heres-what-were-working-on-in-firefox/">Mozilla Blog: Here’s what we’re working on in Firefox</a></li>
  <li><a href="https://dig.watch/updates/smart-tab-management-comes-to-firefox-with-local-ai">Smart Tab Management in Firefox</a></li>
</ul>

<h3 id="privacy-preserving-ai">Privacy-Preserving AI</h3>
<ul>
  <li><a href="https://flower.ai/">Flower.ai: Federated AI Framework</a></li>
  <li><a href="https://flower.ai/intelligence/">Flower Intelligence: End-to-end encryption for AI</a></li>
  <li><a href="https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/">NVIDIA Confidential Computing</a></li>
  <li><a href="https://developer.nvidia.com/blog/confidential-computing-on-h100-gpus-for-secure-and-trustworthy-ai/">NVIDIA H100 Confidential Computing for AI</a> (Performance benchmarks showing &lt;5% overhead)</li>
  <li><a href="https://arxiv.org/html/2409.03992v1">ArXiv: Confidential Computing on H100 GPU Performance Study</a></li>
</ul>]]></content><author><name>Tarek Ziade</name></author><category term="AI" /><category term="Firefox" /><category term="Machine Learning" /><category term="Privacy" /><summary type="html"><![CDATA[When I started working on AI at Mozilla two years ago, I was a Python developer with a background in web services and three months of machine learning experience from working on the Nuclia DB project. I was not someone who had trained models from scratch or built production ML infrastructure. Today, Firefox ships multiple AI features that run entirely on-device, and I helped build the infrastructure that makes that possible. This is a retrospective on what we accomplished and what I learned along the way.]]></summary></entry><entry><title type="html">WebNN is the future of browsers AI</title><link href="https://tarekziade.github.io/2025/11/21/why-webnn-is-the-future-of-ai-in-browsers/" rel="alternate" type="text/html" title="WebNN is the future of browsers AI" /><published>2025-11-21T00:00:00+00:00</published><updated>2025-11-21T00:00:00+00:00</updated><id>https://tarekziade.github.io/2025/11/21/why-webnn-is-the-future-of-ai-in-browsers</id><content type="html" xml:base="https://tarekziade.github.io/2025/11/21/why-webnn-is-the-future-of-ai-in-browsers/"><![CDATA[<p>For years, running machine learning in the browser meant juggling GPU support,
WASM fallbacks, and flags. WebNN changes that by giving the web a standard
inference API between JavaScript and hardware. It is the missing piece that
turns the browser into a first-class AI client runtime.</p>

<p>Running AI locally is the long game. A decade from now laptops and phones will
run much larger models natively, and the best experiences won’t require sending
your data off to a cloud service. WebNN is how the web gets there.</p>

<h2 id="what-webnn-really-is">What WebNN really is</h2>

<p>WebNN is a W3C draft specification that exposes a graph-based neural network
API to the web platform. Instead of binding directly to CUDA or Metal, browsers
map WebNN calls to whatever native acceleration they have: DirectML on Windows,
Core ML on macOS and iOS, NNAPI on Android, or a CPU path via TFLite/XNNPACK.
When a CPU path exists, the browser can fall back there. Think of it as <code class="language-plaintext highlighter-rouge">canvas</code>
for neural networks: you provide the graph, the browser picks the fastest safe
backend.</p>

<ul>
  <li>Spec: <a href="https://www.w3.org/TR/webnn/">https://www.w3.org/TR/webnn/</a></li>
  <li>Demos: <a href="https://webmachinelearning.github.io/webnn-samples-intro/">https://webmachinelearning.github.io/webnn-samples-intro/</a></li>
</ul>

<h3 id="webnn-as-a-graph-converter">WebNN as a graph converter</h3>

<p>WebNN is a graph builder and validator. The browser takes the graph you define
in JS, converts it into a static graph aimed at one of the underlying runtimes
in the OS (DirectML, Core ML, NNAPI, TFLite/XNNPACK, or ONNX Runtime on newer
Windows), and hands it to that native library. The heavy lifting lives there:
compilation, scheduling, and kernel selection. WebNN is the portable contract
that keeps your app code unchanged while the browser targets the best backend.</p>

<p>In Chromium, WebNN uses DirectML by default on Windows and can use the
OS-shipped ONNX Runtime backend on Windows 11 24H2+, falling back to DirectML
otherwise.</p>

<h3 id="why-not-just-use-webgpu">Why not “just use WebGPU”?</h3>

<p>Libraries like ONNX Runtime Web and TF.js already use WebGPU to get more speed,
but that means treating a graphics API as an inference runtime: writing shaders,
managing bindings, and re-implementing scheduling. WebGPU is great for explicit
control; WebNN is the spec we actually want for AI, with portable graphs,
browser-managed backend choice, and no shader boilerplate.</p>

<h2 id="why-this-matters">Why this matters</h2>

<ul>
  <li><strong>Performance without flags:</strong> WebNN can route to GPU, NPU, or CPU without
developers writing backend-specific code. That means near-native throughput
for models like Whisper Tiny or Segment Anything, but delivered via a web
page.</li>
  <li><strong>Predictable portability:</strong> The standard defines ops once; browsers own the
mapping to the best hardware path they have. Apps no longer maintain separate
WebGPU and WASM code paths.</li>
  <li><strong>Battery-aware:</strong> Because browsers control the scheduling and backend choice,
they can pick energy-efficient accelerators over brute-force GPU usage on
laptops or mobile.</li>
</ul>

<h2 id="the-current-state-and-why-it-feels-real-now">The current state (and why it feels real now)</h2>

<p>Chromium-based browsers ship WebNN behind a flag, and ONNX Runtime Web can
use the WebNN execution provider when present. According to the public
implementation status (<a href="https://webmachinelearning.github.io/webnn-status/">webmachinelearning.github.io/webnn-status</a>), the
95 ops in the spec are now covered across Core ML, Windows ML/DirectML, the
WebNN execution provider for ONNX Runtime, and TFLite/XNNPACK (LiteRT) with
only a handful still in flight. That’s enough to make real apps: speech commands,
lightweight summarization, image segmentation, and style transfer.</p>

<p>The momentum is similar to what we saw with WebGPU two years ago: early adopters
can ship progressive enhancements now, and the API will solidify while hardware
vendors line up their drivers.</p>

<p>The big shift is that WebNN moves backend selection into the browser while
keeping a high-level graph API. It is closer to Core ML or DirectML than to raw
GPU programming.</p>

<h2 id="why-i-am-bullish">Why I am bullish</h2>

<p>The web wins by being portable and low friction. AI has been the missing
capability that pushed teams toward native wrappers or cloud-heavy designs.
WebNN gives us a standard, permissionless way to run meaningful AI locally in
the browser while respecting energy and privacy constraints. It unlocks the
boring path to mass adoption: no installs, instant upgrades, and enough
abstraction that developers can stay focused on UX rather than driver matrices.</p>

<p>Now is the time to experiment, measure, and ship progressive AI features. The
future of AI in browsers looks like WebNN.</p>]]></content><author><name>Tarek Ziade</name></author><category term="AI" /><category term="WebNN" /><category term="Browsers" /><summary type="html"><![CDATA[For years, running machine learning in the browser meant juggling GPU support, WASM fallbacks, and flags. WebNN changes that by giving the web a standard inference API between JavaScript and hardware. It is the missing piece that turns the browser into a first-class AI client runtime.]]></summary></entry></feed>