Reference
Performance & tuning
Where the tokens per second come from, and the handful of settings worth touching to get more of them.
Most of the speed is spent for you before you open a panel. Kernels are tuned per model and per chip, decode holds the weights in a low-precision format so the memory bus carries fewer bytes per token, and the runtime sizes the model to your machine at load time. The engine pages cover why that part is fast. What follows is the short list of knobs you can move, and what each one trades.
Where the time goes
A request runs in two phases with different costs. Prefill reads your prompt and fills the cache. It is compute-bound and scales with how many tokens you sent. Decode writes the reply one token at a time, and on a local machine it is memory-bound: every new token re-reads the model weights from memory, so sustained tokens per second comes down to how many bytes the chip moves per step.
A long prompt costs you once, at the start, as a slower first token. A long conversation costs you the whole way through, because the KV cache grows with every turn and takes both memory and bandwidth with it. The levers below move one of those two numbers.
The levers that matter
These live under Settings → Advanced Options in the studio. Defaults run fast on the hardware you have, so touch them only when something gives you a reason to.
- Max context
- The ceiling on the working window, in tokens.
0means “use the model’s full native window,” the right choice on a machine with room to spare. A lower number caps the window below the model’s native size, which reserves less memory for the cache and frees bandwidth for decode. Reach for this one first when a large model crawls. - KV cache quantization
- The cache stores keys and values for every token in the window. Holding them at
q8orq4instead offp16halves or quarters its memory footprint, so a long context fits where it otherwise would not. You pay a small quality loss on long-range recall, not a speed penalty. - Weight quantization
- How tightly the weights themselves are packed. Conifer defaults to
Q4_K_Mbecause decode moves fewer bytes per token at low precision. A higher level likeQ5_K_MorQ8_0buys back a little quality at the cost of memory and speed. The owner page is Quantization. - Prefix cache
- Reuse of the cache computed for a shared conversation prefix, so a follow-up turn does not re-prefill everything you already said. The output is identical, just faster on multi-turn chats. Leave it on unless you are chasing a cache bug.
| Lever | Buys you | Costs you |
|---|---|---|
| Lower max context | Less cache memory, more decode bandwidth | A shorter window to work in |
| q8 / q4 KV cache | A long context that fits | Slightly weaker long-range recall |
| Higher weight quant | A little more answer quality | More memory, fewer tokens per second |
| Prefix cache on | Faster follow-up turns | Nothing; identical output |
Memory is the budget
A model running at a token a second is under memory pressure. The cache pre-allocates the entire window up front, so a large transformer at a 128K native window can reserve tens of gigabytes for the cache alone, on top of the weights. Push the total past physical RAM and decode falls into swap.
Conifer guards against this. With no explicit ceiling set, the runtime sizes the default window to what fits free memory, floored so short chats always work and clamped to what the model natively supports. A model whose architecture carries almost no cache, like a hybrid or sub-quadratic model, keeps its full window because there is nothing large to budget for. The mechanics are in Context & memory.
GPU offload and the metal underneath
Apple silicon has no offload dial to turn. The CPU and GPU share one pool of unified memory, so the weights are already on the GPU with no copy across a bus and nothing partial to split. The only lever is whether the working set fits that pool, which is the memory budget above.
On a discrete GPU the picture is different, because the card has its own memory and the working set has to fit it rather than one shared pool. That is where a VRAM ceiling starts to matter: the largest window whose cache fits the card, with longer requests clamped to fit rather than rejected. Conifer runs on Apple’s Metal GPU today, with a portable CPU path that works anywhere a model loads; broader GPU backends are on the way.
Pick before you tune
The biggest performance decision is which model you load, and it lands before any of these knobs. A smaller model, or a sparse mixture-of-experts model that activates a fraction of its parameters per token, can run several times faster than a dense model of the same total size. For measured decode speeds per model, see the models ledger; to match a model to the RAM you have, see choosing by hardware. If a model will not load, or generation stalls outright instead of just running slow, that is Troubleshooting, and any term here that wants a one-line definition is in the Glossary.