GGML Integration Guide¶
Date: December 8, 2024 Purpose: Guide for adding GGML converter and executor to rustnn
[TARGET] Overview¶
This document outlines the integration of GGML (GPT-Generated Model Language) as a third execution backend for rustnn, alongside ONNX Runtime and CoreML.
Why GGML? - CPU-optimized inference: Excellent performance on CPUs without GPU - Quantization support: 4-bit, 8-bit quantized models for reduced memory usage - LLM-focused: Widely used for large language model inference (llama.cpp, whisper.cpp) - Cross-platform: Linux, macOS, Windows with various backends (CPU, CUDA, Metal, Vulkan) - Lightweight: Minimal dependencies, no runtime memory allocation
GGML Background¶
What is GGML?¶
GGML is a C tensor library for machine learning created by Georgi Gerganov. It's designed for: - Fast and portable tensor operations - Efficient LLM inference on consumer hardware - Quantization to reduce model size (JPEG-like compression for tensors) - Multiple hardware backends (CPU, CUDA, Metal, Vulkan)
Key Resources: - GGML GitHub - Introduction to GGML - GGML Glossary
GGML Architecture¶
Core Concepts:
1. ggml_context: Container holding tensors, graphs, and optionally data
2. ggml_cgraph: Computational graph representing order of operations
3. ggml_backend: Interface for executing graphs (CPU, CUDA, Metal, etc.)
4. ggml_tensor: Tensor data structure with shape and type
5. Deferred execution: Operations build a graph; computation happens at graph_compute()
Workflow:
// 1. Create context
let ctx = ggml_init(...);
// 2. Define tensors and operations
let a = ggml_new_tensor_2d(ctx, type, rows, cols);
let b = ggml_new_tensor_2d(ctx, type, rows, cols);
let result = ggml_add(ctx, a, b);
// 3. Build computation graph
let gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, result);
// 4. Execute
ggml_backend_graph_compute(backend, gf);
Rust Bindings¶
Available Crates: - ggml (v0.1.1): Semi-idiomatic Rust bindings (minimal maintenance) - rusty-ggml (v0.0.8): Idiomatic Rust approach (pre-alpha) - ggml-sys: Raw C bindings
Recommendation: Use ggml crate (most stable, used by llm library)
Integration Architecture¶
Following rustnn Patterns¶
rustnn uses a converter + executor pattern:
Existing Backends: 1. ONNX Runtime: Cross-platform, protobuf → ONNX Runtime execution 2. CoreML: macOS-only, protobuf → CoreML execution
New GGML Backend: 3. GGML: Cross-platform, computation graph → GGML execution
File Structure¶
src/
converters/
mod.rs # Add GgmlConverter registration
onnx.rs
coreml_mlprogram.rs
ggml.rs # NEW: GGML converter
executors/
mod.rs # Add #[cfg(feature = "ggml-runtime")]
onnx.rs
coreml.rs
ggml.rs # NEW: GGML executor
python/
context.rs # Add Backend::Ggml variant
Implementation Plan¶
Phase 1: Converter (GraphInfo → GGML)¶
File: src/converters/ggml.rs
Implementation:
use crate::converters::{ConvertedGraph, GraphConverter};
use crate::error::GraphError;
use crate::graph::{DataType, GraphInfo, Operation, OperandKind};
#[derive(Default)]
pub struct GgmlConverter;
impl GraphConverter for GgmlConverter {
fn format(&self) -> &'static str {
"ggml"
}
fn convert(&self, graph: &GraphInfo) -> Result<ConvertedGraph, GraphError> {
// Convert WebNN graph to GGML computation graph
// Return serialized graph (or placeholder for in-memory graph)
// Strategy: Since GGML graphs are typically built in-memory,
// we may serialize graph structure as JSON/GGUF format or
// keep graph construction in executor
Ok(ConvertedGraph {
format: "ggml",
content_type: "application/octet-stream",
data: vec![], // Placeholder or GGUF format
})
}
}
Key Challenges: 1. In-memory vs serialized: GGML graphs are typically built in-memory, not serialized 2. Format choice: - Option A: Serialize as JSON (graph structure only) - Option B: Use GGUF format (weights + structure) - Option C: Pass GraphInfo directly to executor (no conversion) 3. Quantization: GGML supports quantized tensors; how to handle quantization metadata?
Recommended Approach: Pass GraphInfo directly to executor (Option C) since GGML graphs must be built in-memory with a context. Converter returns a lightweight "marker" indicating GGML format.
Phase 2: Executor (GGML Execution)¶
File: src/executors/ggml.rs
Feature Gate: #[cfg(feature = "ggml-runtime")]
Implementation:
#![cfg(feature = "ggml-runtime")]
use crate::error::GraphError;
use crate::graph::{GraphInfo, OperandDescriptor};
use std::collections::HashMap;
pub struct GgmlOutput {
pub name: String,
pub shape: Vec<usize>,
pub data: Vec<f32>,
}
pub fn run_ggml_with_inputs(
graph: &GraphInfo,
inputs: HashMap<String, GgmlInput>,
) -> Result<Vec<GgmlOutput>, GraphError> {
// 1. Create GGML context
let ctx = ggml::Context::init(...);
// 2. Build GGML computation graph from GraphInfo
let tensors = build_tensors(&ctx, graph, inputs)?;
let cgraph = build_computation_graph(&ctx, graph, &tensors)?;
// 3. Execute graph
let backend = ggml::Backend::cpu_default();
backend.graph_compute(&cgraph)?;
// 4. Extract results
extract_outputs(graph, &tensors)
}
pub struct GgmlInput {
pub name: String,
pub shape: Vec<usize>,
pub data: Vec<f32>,
}
Key Challenges: 1. Tensor creation: Map WebNN operands to GGML tensors 2. Operation mapping: Translate WebNN ops to GGML ops 3. Data flow: Connect operations in correct order 4. Memory management: GGML context owns all tensors 5. Backend selection: CPU, CUDA, Metal based on device hints
Phase 3: Feature Flag & Dependencies¶
File: Cargo.toml
Changes:
[features]
default = []
coreml-runtime = ["objc"]
onnx-runtime = ["onnxruntime"]
ggml-runtime = ["ggml"] # NEW
python = ["pyo3"]
[dependencies]
# ... existing dependencies ...
ggml = { version = "0.1", optional = true } # NEW
Phase 4: Registration¶
File: src/converters/mod.rs
Changes:
mod coreml_mlprogram;
mod onnx;
mod ggml; // NEW
pub use coreml_mlprogram::CoremlMlProgramConverter;
pub use onnx::OnnxConverter;
pub use ggml::GgmlConverter; // NEW
impl ConverterRegistry {
pub fn with_defaults() -> Self {
let mut registry = Self {
converters: HashMap::new(),
};
registry.register(Box::new(OnnxConverter::default()));
registry.register(Box::new(CoremlMlProgramConverter::default()));
registry.register(Box::new(GgmlConverter::default())); // NEW
registry
}
}
File: src/executors/mod.rs
Changes:
#[cfg(all(target_os = "macos", feature = "coreml-runtime"))]
pub mod coreml;
#[cfg(feature = "onnx-runtime")]
pub mod onnx;
#[cfg(feature = "ggml-runtime")] // NEW
pub mod ggml;
Phase 5: Python API Integration¶
File: src/python/context.rs
Changes:
#[derive(Debug, Clone)]
enum Backend {
OnnxCpu,
OnnxGpu,
CoreML,
Ggml, // NEW
None,
}
impl PyMLContext {
fn select_backend(accelerated: bool, power: &str) -> (Backend, bool) {
// Add GGML selection logic
// GGML is CPU-optimized, so prefer when accelerated=false
if !accelerated {
#[cfg(feature = "ggml-runtime")]
return (Backend::Ggml, false);
}
// Existing logic for ONNX/CoreML...
}
fn compute_ggml(
&self,
graph: &PyMLGraph,
inputs: HashMap<String, Py<PyArray<f32, Dim<IxDyn>>>>,
) -> Result<HashMap<String, Py<PyArray<f32, Dim<IxDyn>>>>, GraphError> {
#[cfg(feature = "ggml-runtime")]
{
use crate::executors::ggml::{run_ggml_with_inputs, GgmlInput};
// Convert inputs to GgmlInput
let ggml_inputs = convert_numpy_to_ggml(inputs)?;
// Execute
let outputs = run_ggml_with_inputs(&graph.graph, ggml_inputs)?;
// Convert outputs back to NumPy
convert_ggml_to_numpy(outputs)
}
#[cfg(not(feature = "ggml-runtime"))]
Err(GraphError::BackendUnavailable {
backend: "GGML".to_string(),
})
}
}
[STATS] WebNN to GGML Operation Mapping¶
Supported Operations¶
| WebNN Operation | GGML Equivalent | Notes |
|---|---|---|
add |
ggml_add |
Element-wise addition |
mul |
ggml_mul |
Element-wise multiplication |
matmul |
ggml_mul_mat |
Matrix multiplication |
relu |
ggml_relu |
ReLU activation |
gelu |
ggml_gelu |
GELU activation |
sigmoid |
Custom | Needs implementation via composition |
tanh |
Custom | Needs implementation via composition |
softmax |
Custom | Needs implementation via ggml_soft_max |
transpose |
ggml_transpose |
Tensor transpose |
reshape |
ggml_reshape_* |
Shape transformation |
conv2d |
ggml_conv_1d_* / ggml_conv_2d |
Convolution (limited) |
abs |
ggml_abs |
Absolute value |
neg |
ggml_neg |
Negation |
div |
Custom | Via ggml_div |
Operations Requiring Composition¶
Some WebNN operations require composing multiple GGML operations:
Sigmoid:
// sigmoid(x) = 1 / (1 + exp(-x))
let neg_x = ggml_neg(ctx, x);
let exp_neg_x = ggml_exp(ctx, neg_x);
let one_plus_exp = ggml_add1(ctx, exp_neg_x, 1.0);
let sigmoid = ggml_div(ctx, ggml_new_f32(ctx, 1.0), one_plus_exp);
Tanh:
Operations Not Supported¶
GGML has limited support for some operations: - Pooling: No direct max_pool2d/avg_pool2d (may need custom implementation) - Normalization: Limited batch_norm/layer_norm support - Advanced activations: prelu, elu, leakyRelu require composition - Quantization ops: Limited to GGML's internal quantization
Challenges & Solutions¶
Challenge 1: In-Memory Graph Construction¶
Problem: GGML graphs must be built in-memory with a context. Cannot serialize to bytes like ONNX/CoreML.
Solution:
- Converter returns lightweight marker (empty Vecrun_ggml_with_inputs() instead of serialized bytes
Challenge 2: Limited Operation Coverage¶
Problem: GGML has fewer operations than ONNX (focused on LLMs, not general CV/NLP).
Solution: - Implement missing operations via composition (e.g., sigmoid from exp/div) - Return clear error for unsupported operations - Document operation coverage in implementation-status.md - Focus on LLM-relevant operations initially
Challenge 3: Quantization Integration¶
Problem: GGML's strength is quantization, but WebNN spec doesn't specify quantization formats.
Solution:
- Initially support float32 only (matching ONNX/CoreML)
- Future: Add GGML-specific quantization hints via device selection
- Could use power_preference="low-power" as hint to enable quantization
Challenge 4: Backend Selection¶
Problem: GGML supports multiple backends (CPU, CUDA, Metal). How to select?
Solution:
- Follow WebNN Device Selection Explainer pattern
- accelerated=false → GGML CPU (best use case)
- accelerated=true → GGML CUDA/Metal if available
- Query available backends at runtime: ggml::Backend::available()
Challenge 5: Rust Bindings Maturity¶
Problem: GGML Rust bindings are in minimal maintenance mode (v0.1.1).
Solution:
- Use stable ggml crate (0.1.1) with limited but working API
- Consider rusty-ggml if need more idiomatic Rust (pre-alpha risk)
- Contribute improvements back to ggml-rs if needed
- Fallback: Use ggml-sys (raw bindings) if safe wrapper insufficient
[TARGET] Implementation Roadmap¶
Phase 1: Proof of Concept (1-2 days)¶
- [ ] Add
ggmldependency with feature flag - [ ] Create minimal executor for basic operations (add, mul, matmul)
- [ ] Test with simple WebNN graph (2 inputs + add + output)
- [ ] Validate tensor I/O works correctly
Phase 2: Core Operations (3-5 days)¶
- [ ] Implement operation mapping for 20 core operations
- [ ] Add tensor shape inference for GGML tensors
- [ ] Implement computation graph building from GraphInfo
- [ ] Add comprehensive unit tests
Phase 3: Python Integration (2-3 days)¶
- [ ] Add Backend::Ggml to context selection
- [ ] Implement
compute_ggml()method - [ ] Add device selection logic (GGML for CPU-only)
- [ ] Test with Python API examples
Phase 4: Documentation & Examples (1-2 days)¶
- [ ] Update docs/implementation-status.md with GGML coverage
- [ ] Update docs/architecture.md with GGML backend
- [ ] Create example:
examples/ggml_inference.py - [ ] Update README.md with GGML backend section
Phase 5: Advanced Features (Future)¶
- [ ] Quantization support (Q4, Q8)
- [ ] Multiple backend selection (CPU/CUDA/Metal)
- [ ] Performance benchmarks vs ONNX
- [ ] LLM-specific optimizations
Total Estimated Time: 7-12 days for phases 1-4
Testing Strategy¶
Unit Tests (Rust)¶
File: src/converters/ggml.rs
#[cfg(test)]
mod tests {
#[test]
fn converts_simple_graph() {
let graph = create_add_graph();
let converter = GgmlConverter::default();
let result = converter.convert(&graph);
assert!(result.is_ok());
}
}
File: src/executors/ggml.rs
#[cfg(all(test, feature = "ggml-runtime"))]
mod tests {
#[test]
fn executes_add_operation() {
let graph = create_add_graph();
let inputs = create_test_inputs();
let outputs = run_ggml_with_inputs(&graph, inputs).unwrap();
assert_eq!(outputs.len(), 1);
// Verify output values
}
}
Python Tests¶
File: tests/test_ggml_backend.py
import pytest
import webnn
import numpy as np
@pytest.mark.skipif(not has_ggml_runtime(), reason="GGML runtime not available")
def test_ggml_add():
ml = webnn.ML()
context = ml.create_context(accelerated=False) # Should select GGML
builder = context.create_graph_builder()
x = builder.input("x", [2, 3], "float32")
y = builder.input("y", [2, 3], "float32")
z = builder.add(x, y)
graph = builder.build({"output": z})
inputs = {
"x": np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32),
"y": np.array([[1, 1, 1], [2, 2, 2]], dtype=np.float32),
}
outputs = context.compute(graph, inputs)
expected = np.array([[2, 3, 4], [6, 7, 8]], dtype=np.float32)
np.testing.assert_allclose(outputs["output"], expected)
Makefile Targets¶
# Add to Makefile
.PHONY: ggml-dev
ggml-dev:
maturin develop --features python,ggml-runtime
.PHONY: test-ggml
test-ggml:
cargo test --features ggml-runtime
pytest tests/test_ggml_backend.py -v
References¶
GGML Resources¶
- GGML GitHub Repository
- Introduction to GGML (HuggingFace)
- GGML Glossary
- Experimenting with GGML Tutorial
Rust Bindings¶
WebNN Spec¶
Existing Implementations¶
Summary¶
GGML Integration Value: - [OK] CPU-optimized inference for environments without GPU - [OK] Quantization support for memory-constrained devices - [OK] Cross-platform (Linux, macOS, Windows) - [OK] LLM-focused operations and optimizations - [OK] Lightweight with minimal dependencies
Key Design Decisions:
1. Pass GraphInfo directly to executor (no serialization)
2. Focus on LLM-relevant operations initially
3. Use accelerated=false hint to select GGML backend
4. Start with float32, add quantization later
5. Use stable ggml crate (v0.1.1)
Next Steps: 1. Create proof of concept with basic operations 2. Validate tensor I/O and graph execution 3. Expand operation coverage incrementally 4. Integrate with Python API 5. Document and test thoroughly
Status: Planning document (not yet implemented)