How models work
Context & memory
A transformer remembers the conversation in a cache that grows with every token. On a personal machine that cache, not the weights, decides how much context you can keep.
Context is how much text a model holds in mind at once: the system prompt, the documents you pasted, every turn so far, and the reply it is writing. The window is measured in tokens, roughly three-quarters of a word each. The expensive part is not the tokens. It is the memory the model spends to avoid re-reading them.
The KV cache
Attention compares the current token against every token before it. Each past token contributes a key and a value vector per layer. Computing those vectors is most of the work, and they never change once a token lands in the history, so the engine keeps them around. That store is the KV cache. Drop it, and generating the thousandth token means recomputing keys and values for the nine hundred and ninety-nine before it, on every step.
The cache is what makes decode fast and what makes context costly. It grows by a fixed number of bytes per token you keep, set by the model’s shape: the layer count, the number of key/value heads, and the size of each head. Conifer’s engine stores keys and values in fp16 by default, so a token costs the same handful of bytes whether you typed a word or pasted a contract. Two tokens cost twice as much. A full 128K window costs 128K times the per-token figure, and that number can rival the weights themselves.
The memory budget
On a phone or a cloud node, the model and its cache live in separate pools. An Apple-silicon Mac has one pool of unified memory shared by the CPU and GPU, and three things compete for it: the weights, the KV cache, and the working scratch a forward pass needs. Whatever is left over, the system still wants for everything else you have open.
| Item | Scales with | Cost |
|---|---|---|
| Weights | parameters × quantization | fixed once loaded |
| KV cache | context length × model shape | grows with the window |
| Scratch | batch and layer size | small, transient |
Quantization sets how many bytes a weight takes; see the quantization page. The model ledger lists each model’s size, so you can read the weights line before downloading.
Weights you can size up front: parameter count times bytes per weight, which a lower quantization level shrinks. The cache is the part you control after loading, by choosing how large a window to keep open. The two add together, and they have to fit with room to spare. A model that spills past physical RAM into swap stops being a model you want to use.
Why a 24B model dropped to one token a second
Take a 24B dense model on a 36 GB Mac. At 4-bit the weights are roughly 14 GB, which fits with headroom. The model advertises a native window of 128K tokens, though, and a naive runtime allocates the cache for the whole window at load: around 21 GB of fp16 keys and values for that 24B’s shape. Weights plus cache come to nearly 36 GB on a 36 GB machine.
Nothing crashes. The allocation succeeds because unified memory backs it with swap, and from then on every token has to touch keys and values that no longer live in fast memory. Decode falls off a cliff, from tens of tokens a second to about one. The model is correct and unusable at the same time. Blame does not land on the weights, which fit; it lands on a cache sized for a window almost no chat will ever fill.
How Conifer sizes the window for you
The engine reports the cache cost per token for every model it loads, summed across the layers that actually keep a K/V buffer. The desktop reads that figure and picks a default window that fits free memory rather than requesting the full native one. A small model with a cheap cache keeps its whole window. A large transformer gets capped to the longest window whose cache still fits, so the weights, the cache, and your headroom all coexist. Raise the window when you have a long document and the memory to hold it. The safe default is chosen at load, not a setting you stumble onto after the model has already slowed to a crawl.
Two ways to keep the cache small
Capping the window is one lever. Two more act on the cache itself rather than on how much of it you keep.
- quantize the cache
- Keys and values are fp16 by default, but the engine can store them in an 8-bit layout, roughly halving the cache for a small quality cost from the quant round-trip. That buys more usable window inside the same budget when context is the constraint.
- change the architecture
- A standard attention layer pays K/V for every token. A hybrid or sub-quadratic model swaps most of those layers for operators that carry a small fixed-size state, so its cache stays nearly flat as the window grows. That is the structural fix when long context is the whole point.
A sparse MoE model does not help here. It cuts the compute per token, not the cache, so a long window costs it the same K/V as a dense model of the same shape. The sample below shows the budget the engine balances on a 36 GB Mac.
weights ~14 GB 24B dense at 4-bit, fixed
KV cache + grows with the window you keep open
full 128K window ≈ 21 GB → spills, ~1 tok/s
sized to free RAM → fits, fast
scratch small, transient
--------
36 GB unified memory, shared with everything elseReading context when you pick a model
A model’s native window is a ceiling, not a promise. It tells you how far the weights were trained to attend; your hardware decides how much of that you can afford to keep resident. When the window is the job rather than the conversation, work the problem from both ends. See Long context & RAG for the models and settings that keep a long window fast, and the model ledger for the per-model context lengths and measured decode speeds.