Inside the engine
How the engine works
Making a model fast on your own hardware is four optimizations, not one. Each sits on a different surface, pulls a different lever, and has a different lifetime.
“Optimized” usually means one thing got faster. The work splits cleaner than that. It lands on four surfaces, and they stay apart so each one gets optimized once, at the moment that work is cheapest, and the results compose. Blur the surfaces together and you re-pay for the whole stack on every token.
The four surfaces
Each surface answers a different question, and each question is settled at a different time. The lifetimes shorten down the table.
| Surface | The question it answers | When it is decided |
|---|---|---|
| The architecture | What shape is this model? | Ahead of time, per family |
| The model | Which exact weights, on which machine? | At load time |
| The prompt | What is in the context right now? | Per request |
| The tool call | What must the output obey? | Per step in an agent loop |
A kernel compiled once per architecture, a memory fit computed once per load, a prefill run once per prompt, a grammar constraint applied at every decode step inside a tool call.
Why the split is the design
Each surface has its own lever, and each lever gets pulled at a different rate. That mismatch is the reason to keep them apart.
- Architecture is structural and shared.
- A kernel tuned for a model’s shape is written once and serves every model in that family afterward. It is the last work to redo per prompt, so the engine settles it ahead of time and per family, never per request.
- The model is specific but stable for a session.
- Once a set of weights is resident, its quantization, its memory fit, and its cache layout hold until the model switches. The engine resolves all three once, at load.
- The prompt changes every request, but most of it repeats.
- The new tokens are the expensive part. The lever is computing them once and reusing the rest of the context across turns instead of re-reading it from scratch.
- The tool call is the tightest loop there is.
- It acts at the level of a single decoded token, narrowing what the model is allowed to emit so the output parses by construction rather than by hope.
Hold the surfaces apart and every optimization stays honest about its cost and composes with the rest. Collapse them and you get the familiar failure: a clever trick that helps one model on one prompt and quietly taxes you everywhere else.
How they compose on a single token
The four are not a menu. A single token in an agent loop crosses all of them in order. The architecture kernel decides how the weights stream and the math is fused. The loaded model decides which quantized bytes move and how much KV cache there is to read. The prompt layer decides how much of the context is already computed and reused versus freshly prefilled. The tool call decides which tokens are even legal at this step. Each layer assumes the one beneath it is already settled, and settling them at different times is what makes that assumption safe.
That is a kernel chosen per family on Metal or CUDA, a fit computed when weights are pulled from the catalog, a prefill amortized across a conversation, and a constraint enforced one decode step at a time. None of it is configuration. The defaults run the whole chain.