Running SmolLM-135M in rustnn with flexible inputs

2026-02-18

WebNN is emerging as a portable, browser-friendly inference API. But LLMs hit a hard wall: dynamic inputs.

Autoregressive transformers fundamentally mutate state at runtime. KV cache tensors evolve at every step, sequence lengths vary with prompts, and shape expressions flow through operators like Shape, Gather, Concat, Reshape, and Expand.

Today, this does not map cleanly to WebNN’s static-graph constraints.

At step 1, KV cache length is 1. At step 512, KV cache length is 512. That is not a static graph.

Why this matters

If this is not solved, WebNN stays limited to fixed-shape demos for many LLM use cases.

For real product workloads, people need variable prompt lengths, efficient token-by-token decode, and stable latency as context grows. Without that, local browser LLM UX degrades quickly and teams default back to heavier alternatives.

Dynamic inputs in practice

This problem is now well documented in the WebNN WG in Issue #883: Support flexible input sizes.

The issue captures exactly what we see in practice:

Vision models with runtime-determined resolution
Speech/decoder models where KV cache grows by one token per step
LLMs with arbitrary prompt lengths and dynamic cache shapes

In other words: modern inference workloads.

The ONNX Runtime workaround (and why it hurts)

ONNX Runtime WebNN has had to work around this limitation by routing dynamic-shape parts away from WebNN and into WASM execution paths.

It works, but performance is terrible for autoregressive generation because you bounce between fast backend code and slow fallback code in the hottest part of the loop.

This architecture can make demos pass, but it creates significant performance penalties in real autoregressive workloads.

In preliminary runs, keeping decode on one backend avoids the repeated fallback round-trips that dominate token latency.

So instead of accepting that, we decided to push WebNN support further.

I started this in Malta

I started prototyping this while on vacation in Malta.

What began as a small experiment quickly turned into deep changes across three repositories: converter internals, runtime shape validation, and KV-cache plumbing.

The work happened across:

webnn-graph: ONNX lowering and dynamic-input metadata support
rustnn (tarek-flexible-input): dynamic dimensions in graph/runtime + checked execution
pywebnn: Python demos and loading-from-Hub workflows

I also made sure to surface rustnn through Python (pywebnn) very early.

Most ML engineers live in Python land, and I wanted this work to be reachable by that ecosystem immediately: easier model validation, easier parity checks against transformers, and faster feedback from people who already ship models every day.

What changed in practice

The key was to support bounded dynamic dimensions end to end.

Why bounded and not fully dynamic? Because many backends still need strong compile-time guarantees for validation, allocation, and kernel planning. Fully unbounded shapes are hard to optimize and hard to validate safely.

Bounded dynamic dimensions are the practical compromise: keep symbolic runtime flexibility, but define a maximum range so memory and execution planning remain deterministic.

This allows the full autoregressive decode loop to stay inside the WebNN backend, without bouncing into slower fallback paths.

This is also better than common alternatives:

Padding everything to worst-case shapes wastes memory and compute
Re-exporting one graph per shape explodes complexity
Falling back dynamic parts to WASM in hot decode loops kills throughput

For example, you can bound a sequence dimension to 2048 tokens: large enough for real prompts, still finite for backend planning and allocation.

In rustnn, tensor dimensions can now be static values or dynamic descriptors with a name and a max size, then checked at runtime:

{
  "inputs": {
    "x": {
      "dataType": "float32",
      "shape": [
        { "name": "batch", "maxSize": 16 },
        128
      ]
    }
  }
}

In webnn-graph, ONNX conversion can preserve unresolved input dynamics while still lowering shape-driving expressions needed by WebNN.

That lets us keep flexibility where possible while still emitting valid WebNN graphs.

SmolLM-135M converted and running

With flexible inputs supported end to end, SmolLM-135M converts cleanly: no shape rewriting hacks, no per-length exports, no WASM fallback in the decode loop. The artifacts are published here:

tarekziade/SmolLM-135M-webnn

Then I built a Python demo in pywebnn/examples/smollm_from_hub.py that:

downloads model.webnn, model.weights, and manifest from the Hub
downloads tokenizer.json
runs token-by-token generation with dynamic KV cache growth
optionally compares output against transformers

A few extracts from that demo:

The demo defaults to the Hub-hosted WebNN artifacts:

DEFAULT_MODEL_ID = "tarekziade/SmolLM-135M-webnn"

model_files = resolve_model_files(args.model_id, force=args.force_download)
graph = webnn.MLGraph.load(
    model_files["graph"],
    manifest_path=model_files["manifest"],
    weights_path=model_files["weights"],
)

past_key_values holds the growing KV-cache tensors returned by the previous step. The decode loop feeds them back on every token:

def run_step(token_id: int, position: int) -> np.ndarray:
    inputs = {
        "input_ids": np.array([[token_id]], dtype=np.int64),
        "position_ids": np.array([[position]], dtype=np.int64),
        "attention_mask": np.ones((1, position + 1), dtype=np.int64),
        **past_key_values,
    }
    outputs = context.compute(graph, inputs)
    ...

The demo can also run a transformers baseline and fail fast on divergence:

if args.compare_transformers:
    hf_generated, hf_text, hf_prompt_ids = run_transformers_baseline(...)
    ...
    if generated_text != hf_text:
        print("[ERROR] WebNN and transformers generated different text output")
        sys.exit(1)

Correctness checks against transformers are critical. Performance improvements mean nothing if generation diverges.

Lessons learned

Fully unbounded dynamic shapes are rarely necessary for practical decode loops
Bounded flexibility captures most real workloads while keeping backends sane
Python exposure (pywebnn) accelerates model validation and ecosystem feedback

What is next

Flexible inputs will likely be important if WebNN is to support real LLM workloads.

Static graphs alone are not enough for modern inference. Bounded flexibility is the pragmatic bridge.

And while this work pushes WebNN forward, we are also giving a lot of love to the TensorRT backend these days, because high-performance local inference matters just as much as API design.

Tarek Ziadé