skip to content
How the engine works

Inside the engine

How the engine works

Making a model fast on your own hardware is four optimizations, not one. Each sits on a different surface, pulls a different lever, and has a different lifetime.


“Optimized” usually means one thing got faster. The work splits cleaner than that. It lands on four surfaces, and they stay apart so each one gets optimized once, at the moment that work is cheapest, and the results compose. Blur the surfaces together and you re-pay for the whole stack on every token.

The four surfaces

Each surface answers a different question, and each question is settled at a different time. The lifetimes shorten down the table.

SurfaceThe question it answersWhen it is decided
The architectureWhat shape is this model?Ahead of time, per family
The modelWhich exact weights, on which machine?At load time
The promptWhat is in the context right now?Per request
The tool callWhat must the output obey?Per step in an agent loop

A kernel compiled once per architecture, a memory fit computed once per load, a prefill run once per prompt, a grammar constraint applied at every decode step inside a tool call.

Why the split is the design

Each surface has its own lever, and each lever gets pulled at a different rate. That mismatch is the reason to keep them apart.

Architecture is structural and shared.
A kernel tuned for a model’s shape is written once and serves every model in that family afterward. It is the last work to redo per prompt, so the engine settles it ahead of time and per family, never per request.
The model is specific but stable for a session.
Once a set of weights is resident, its quantization, its memory fit, and its cache layout hold until the model switches. The engine resolves all three once, at load.
The prompt changes every request, but most of it repeats.
The new tokens are the expensive part. The lever is computing them once and reusing the rest of the context across turns instead of re-reading it from scratch.
The tool call is the tightest loop there is.
It acts at the level of a single decoded token, narrowing what the model is allowed to emit so the output parses by construction rather than by hope.

Hold the surfaces apart and every optimization stays honest about its cost and composes with the rest. Collapse them and you get the familiar failure: a clever trick that helps one model on one prompt and quietly taxes you everywhere else.

How they compose on a single token

The four are not a menu. A single token in an agent loop crosses all of them in order. The architecture kernel decides how the weights stream and the math is fused. The loaded model decides which quantized bytes move and how much KV cache there is to read. The prompt layer decides how much of the context is already computed and reused versus freshly prefilled. The tool call decides which tokens are even legal at this step. Each layer assumes the one beneath it is already settled, and settling them at different times is what makes that assumption safe.

That is a kernel chosen per family on Metal or CUDA, a fit computed when weights are pulled from the catalog, a prefill amortized across a conversation, and a constraint enforced one decode step at a time. None of it is configuration. The defaults run the whole chain.