Inside the engine
Optimizing around the prompt
The prompt is the only surface that changes every request, and most of it is the same as last time. The engine’s job is to make the first token fast and the repeated context nearly free.
Every other surface is settled before the request arrives. The kernels are chosen per family, the weights are fit to your machine at load. The prompt is whatever sits in the context right now, and it shows up fresh on every turn. That makes it the surface where latency is most visible. It also hides the cheapest win in the runtime, because the bulk of a prompt rarely changes from one turn to the next.
Prefill and decode are two different jobs
A generation runs in two phases with opposite shapes. Prefill reads the whole prompt and computes a key and value for every token in it, filling the KV cache that decode will lean on. Every prompt token is known up front, so prefill runs them in parallel: one wide pass over the matrix multiplies that keep the GPU busy. Decode inverts that. It emits one token at a time, each depending on the one before it, so the work is serial and gated by how fast weights stream from memory rather than by arithmetic.
| Phase | What it does | Shape | Bound by |
|---|---|---|---|
| Prefill | Read the prompt, fill the KV cache | All tokens at once, parallel | Compute |
| Decode | Emit the reply, one token per step | Serial, each step depends on the last | Memory bandwidth |
conifer bench reports each separately: TTFT and prefill tok/s for the first phase, decode tok/s for the second.
Time to first token, the pause before the reply starts streaming, is almost all prefill. A short question prefills in a blink. A pasted contract or a long agent context takes a noticeable beat: prefill cost grows with how many tokens the model reads before it can say anything. The engine cuts that beat with the same fused, shape-aware matrix-multiply kernels from optimizing around the architecture, tiled here for the wide parallel pass prefill needs.
Reusing the prefix you already paid for
The prompt layer rests on one observation. Across a conversation, and far more so inside an agent loop, most of each new prompt is identical to the last: the same system prompt, the same tool definitions, the same documents in context. Only the tail moves, your new message and the model’s growing reply. Recomputing the shared head every turn is pure waste; its keys and values already sit in the cache.
So the engine keeps them. On a reused turn it compares the new prompt’s tokens against the ones retained from the previous turn, finds the longest run that matches from the start, and restores exactly that many cached positions instead of recomputing them. Prefill runs over the divergent suffix alone. The longer and steadier your prefix, the cheaper the next turn, until the marginal prompt is the handful of new tokens you typed.
Two limits. The final prompt token is always recomputed, since the decode loop needs its logits to pick the next word, so even a fully matched prompt runs one forward step. And the engine retains a single prefix per loaded model, the shared head the current session keeps returning to, rather than a content-addressed cache of every prompt it has ever seen. That covers the case that matters, a long static context a chat or an agent turn revisits again and again, without spending memory on prompts you will not.
Sizing the window to the work
Prefill and reuse set how fast the prompt computes. The window sets how much of it the engine holds in memory at once, and that is its own lever. A model ships with a native context length it was trained to attend over, but holding that whole window open reserves a KV cache sized for it, whether or not you ever fill it. On a personal machine that reservation competes with the weights for the same pool.
The fit is settled when the model loads, not per prompt, so the full mechanics live one surface over in optimizing around the model, and the memory math behind it in context & memory. The principle that touches the prompt: size the window to the work in front of you. A chat needs a few thousand tokens of history. A document task needs the document. A 128K window for a short conversation buys nothing and costs cache, and a cache that spills past physical memory drags decode down far worse than a smaller window ever would.
Why this compounds in an agent loop
A single chat reply is one prefill and one decode. An agent turn is many: the model reads its context, calls a tool, reads the result, calls another, and each step re-sends the same large preamble with a little more appended. Without reuse, every step re-pays for the entire system prompt and tool grammar. With it, the preamble computes once and the per-step cost collapses to the new tokens. That is the difference between a local model that feels responsive inside a loop and one that goes thoughtful-then-slow.
The tool call itself is the next surface down, where decoding is constrained so the output parses by construction. That is optimizing around the tool call. The prompt layer’s contribution is quieter and earlier: get the first token out fast, then make the context you keep re-reading cost almost nothing.
turn 1 prefill [ system + tools + context ] + [ user ] ← full
turn 2 reuse [ system + tools + context ............ ] ← restored
prefill [ + tool result ] + [ next step ] ← suffix only
turn 3 reuse [ ...................................... ] ← restored
prefill [ + tool result ] + [ next step ] ← suffix onlyNone of this is configuration you fill in. The defaults run prefill, retain the prefix, and size the window for you. For the measured prefill and decode numbers behind any one model, the model ledger is the source.