Inside the engine
Optimizing around the model
The architecture is shared across a family. One set of weights on one machine is specific, and that fit is settled once, at load.
A kernel is written for a model’s shape and serves every model in that family. A fit is narrower. It turns on the exact weights you chose, the precision they ship in, and how much memory is free the moment you press load. The engine resolves all of that once and holds it for the session: which bytes to map, how the cache is laid out, how long a window to keep open. None of it is a setting you fill in. Resolving at load means the answer stays valid until you switch models, so no per-token work re-pays for it.
How the weights arrive
A GGUF file is mapped, not copied. On an Apple-silicon Mac the GPU reads the same unified memory the CPU does, so the engine wraps the mapped pages directly as device buffers rather than streaming gigabytes across a bus. The weights go resident the instant the file is mapped. That is the load-time win of unified memory: a cold start is dominated by reading the file off disk, not by any copy the runtime makes.
The precision those bytes carry is fixed by the file you picked. Each catalog row is a different quantization of the same trained weights, and the engine streams them in their stored layout, dequantizing inside the kernel as it reads. The fit math starts from one number. Parameter count times bytes per weight gives the resident cost of the weights, and that figure does not move once they are mapped.
The cache the weights imply
Weights are the fixed cost. The variable one is the KV cache, and its size per token follows from this specific model’s shape, computed at load before a single token is generated. The engine walks the layers and sums what each one keeps: two vectors, a key and a value, per key/value head, at the head dimension, in fp16. Layers that carry a small fixed state instead of attention, the convolutional and linear ones in a hybrid model, add nothing to the sum.
That per-token figure is the lever everything downstream pulls on. The estimate the engine uses to allocate the cache is the same one it hands the desktop to size your window, so the projection and the real allocation never disagree. Two models of similar parameter count can imply very different caches depending on head layout, which is why memory pressure is a per-model fact and not a parameter-count fact.
Sizing the window to the memory you have
A model advertises a native window, often 128K tokens. Allocating the cache for that whole window at load is what used to break large models on personal machines. A 24B transformer asked for its full window pre-allocates roughly 21 GB of keys and values, which on a 36 GB Mac spills the weights and cache into swap and drops decode to about one token a second. The model is correct and unusable at the same time.
So the default is not the native window. With no context length named, the engine caps the cache to a fixed memory budget for an un-requested run, then works the window out from the per-token figure it just computed. A model with a cheap cache clears the budget with its whole native window intact. A model with an expensive one gets the longest window whose cache still fits, far below what its full window would have demanded.
| Model shape | Cache per token | What the default window does |
|---|---|---|
| Hybrid, mostly fixed-state layers | tiny | Keeps its full native window under the budget |
| Large dense transformer | large | Capped to the longest window that still fits free RAM |
The per-token cost is computed from the loaded model's own layer layout, so the cap is tailored to the weights you actually loaded, not to a class average.
Raise the window when you have a long document and the memory to hold it open. The default exists so a capable model is never quietly slow on a machine that could run it well. It is the floor, not a ceiling. The out-of-memory guard underneath is separate: ask for more than fits and the allocation fails loudly rather than slipping into swap.
When the window is full
A capped window settles what fits at load. A long-running session that outgrows the window you set is a separate problem. Dropping the cache and recomputing it from scratch is the expensive answer. Evicting is the cheap one.
- pinned sinks
- The first few tokens of a sequence anchor attention out of all proportion to their content. The engine can pin those few permanently, so evicting later tokens does not destabilize the distribution the way dropping the opening would.
- rolling window
- Past the pinned sinks, the cache holds a moving window of the most recent tokens. New ones arrive, the oldest fall out, and the cache stays a fixed size no matter how long the session runs, so decode speed does not decay as the transcript grows.
A second lever shrinks the cache rather than bounding it. Keys and values are fp16 by default; storing them in an 8-bit layout roughly halves the cache for a small quality cost from the round-trip, which buys back window inside the same budget. Both levers act on the cache the loaded weights imply, which is why they belong to this surface and not to the kernel beneath it.
Reading the fit before you load
All of this runs from defaults: pick weights that fit and the runtime sizes the rest. The catalog answers the fit question on the card, weighing the weights at their precision plus the cache against your usable memory, before you download. To pick by task and let the runtime do the fitting, start at Your first model; for the per-model sizes and measured decode speeds behind any one verdict, the model ledger is the source. The next surface down is the prompt, where the cache stops being a budget and becomes something to reuse across turns.