skip to content
Context & memory

How models work

Context & memory

A transformer remembers the conversation in a cache that grows with every token. On a personal machine that cache, not the weights, decides how much context you can keep.


Context is how much text a model holds in mind at once: the system prompt, the documents you pasted, every turn so far, and the reply it is writing. The window is measured in tokens, roughly three-quarters of a word each. The expensive part is not the tokens. It is the memory the model spends to avoid re-reading them.

The KV cache

Attention compares the current token against every token before it. Each past token contributes a key and a value vector per layer. Computing those vectors is most of the work, and they never change once a token lands in the history, so the engine keeps them around. That store is the KV cache. Drop it, and generating the thousandth token means recomputing keys and values for the nine hundred and ninety-nine before it, on every step.

The cache is what makes decode fast and what makes context costly. It grows by a fixed number of bytes per token you keep, set by the model’s shape: the layer count, the number of key/value heads, and the size of each head. Conifer’s engine stores keys and values in fp16 by default, so a token costs the same handful of bytes whether you typed a word or pasted a contract. Two tokens cost twice as much. A full 128K window costs 128K times the per-token figure, and that number can rival the weights themselves.

The memory budget

On a phone or a cloud node, the model and its cache live in separate pools. An Apple-silicon Mac has one pool of unified memory shared by the CPU and GPU, and three things compete for it: the weights, the KV cache, and the working scratch a forward pass needs. Whatever is left over, the system still wants for everything else you have open.

What sits in unified memory while a model runs
ItemScales withCost
Weightsparameters × quantizationfixed once loaded
KV cachecontext length × model shapegrows with the window
Scratchbatch and layer sizesmall, transient

Quantization sets how many bytes a weight takes; see the quantization page. The model ledger lists each model’s size, so you can read the weights line before downloading.

Weights you can size up front: parameter count times bytes per weight, which a lower quantization level shrinks. The cache is the part you control after loading, by choosing how large a window to keep open. The two add together, and they have to fit with room to spare. A model that spills past physical RAM into swap stops being a model you want to use.

Why a 24B model dropped to one token a second

Take a 24B dense model on a 36 GB Mac. At 4-bit the weights are roughly 14 GB, which fits with headroom. The model advertises a native window of 128K tokens, though, and a naive runtime allocates the cache for the whole window at load: around 21 GB of fp16 keys and values for that 24B’s shape. Weights plus cache come to nearly 36 GB on a 36 GB machine.

Nothing crashes. The allocation succeeds because unified memory backs it with swap, and from then on every token has to touch keys and values that no longer live in fast memory. Decode falls off a cliff, from tens of tokens a second to about one. The model is correct and unusable at the same time. Blame does not land on the weights, which fit; it lands on a cache sized for a window almost no chat will ever fill.

How Conifer sizes the window for you

The engine reports the cache cost per token for every model it loads, summed across the layers that actually keep a K/V buffer. The desktop reads that figure and picks a default window that fits free memory rather than requesting the full native one. A small model with a cheap cache keeps its whole window. A large transformer gets capped to the longest window whose cache still fits, so the weights, the cache, and your headroom all coexist. Raise the window when you have a long document and the memory to hold it. The safe default is chosen at load, not a setting you stumble onto after the model has already slowed to a crawl.

Two ways to keep the cache small

Capping the window is one lever. Two more act on the cache itself rather than on how much of it you keep.

quantize the cache
Keys and values are fp16 by default, but the engine can store them in an 8-bit layout, roughly halving the cache for a small quality cost from the quant round-trip. That buys more usable window inside the same budget when context is the constraint.
change the architecture
A standard attention layer pays K/V for every token. A hybrid or sub-quadratic model swaps most of those layers for operators that carry a small fixed-size state, so its cache stays nearly flat as the window grows. That is the structural fix when long context is the whole point.

A sparse MoE model does not help here. It cuts the compute per token, not the cache, so a long window costs it the same K/V as a dense model of the same shape. The sample below shows the budget the engine balances on a 36 GB Mac.

budget
weights    ~14 GB   24B dense at 4-bit, fixed
KV cache   + grows with the window you keep open
           full 128K window ≈ 21 GB  → spills, ~1 tok/s
           sized to free RAM         → fits, fast
scratch    small, transient
--------
36 GB unified memory, shared with everything else

Reading context when you pick a model

A model’s native window is a ceiling, not a promise. It tells you how far the weights were trained to attend; your hardware decides how much of that you can afford to keep resident. When the window is the job rather than the conversation, work the problem from both ends. See Long context & RAG for the models and settings that keep a long window fast, and the model ledger for the per-model context lengths and measured decode speeds.