Inside the engine
Optimizing around the architecture
A dense transformer, a sparse mixture of experts, and a sub-quadratic hybrid do different work on every token. Each one runs on a kernel built for that work, not a generic one that fits all and wins at none.
Architecture is the model’s structure, fixed before any weight is loaded. It sets the dominant operation per token, and that operation decides where the time goes. Conifer ships kernels specialized to each shape on Metal and CUDA. The gap between a generic kernel and a specialized one is most of your tokens per second.
Dense: the bandwidth wall
The decode step of a dense model is a matrix-vector product against every weight, a GEMV. It does almost no arithmetic per byte it reads, which makes it memory-bandwidth-bound. The kernel streams the quantized weights through the GPU at peak bandwidth, dequantizing inline and folding the trivial math in so nothing stalls on memory. No trick beats the bandwidth ceiling here. There are only kernels that sit closer to it.
Sparse: routing without waste
A mixture-of-experts model adds a router that picks a few experts per token. The kernel runs only the selected work: it gathers the active experts’ weights, runs them, and never touches the rest. You get a small model’s decode speed out of a large model’s knowledge. The one constraint it cannot remove is that every expert stays resident in memory, since the next token may route anywhere.
Hybrid: sequential state
A hybrid or sub-quadratic model swaps most attention for gated convolutions or state-space scans. These carry a small running state forward instead of attending over a cache that grows with context. The kernel work changes in kind, a sequential scan rather than one big parallel matmul, and the memory curve stays nearly flat as the window gets long. A genuinely new state-space shape is why a model is sometimes load-gated: the architecture needs its own kernel before it can run at all.
Attention variants
Attention is not uniform across dense models. The decode kernel reads each variant on its own terms.
- Grouped-query attention (GQA)
- Several query heads share one key/value head, which shrinks the KV cache. The kernel reads the smaller grouped layout straight from memory instead of expanding it out first.
- Sliding-window attention
- Each token attends only a bounded distance back. The Gemma family interleaves windowed and global layers, which holds the cache flat at long context.
- Soft-capping
- Gemma 2 bounds its attention logits with a soft cap, a
tanhsquash the kernel has to apply exactly. Drop it and quality degrades quietly rather than loudly. Honoring it is part of supporting the architecture, not an extra on top.
Why specialize at all
One generic kernel would run every model and lead on none. It cannot exploit GQA’s smaller reads, sit out an MoE’s idle experts, or scan a state-space layer at speed. The throughput comes from the specialization. That is also why support is tracked per architecture and not per model: a new family that reuses a known shape runs the day it lands, and one with a genuinely new shape waits on its kernel.
# support is per shape: a Qwen3 MoE reuses the
# mixture-of-experts path, so it runs on arrival
conifer run --model qwen3-30b-a3bThe shape is the first surface. The next is the exact weights you loaded: their quantization, their fit to your memory, and the cache layout, in Optimizing around the model.