skip to content
Performance & tuning

Reference

Performance & tuning

Where the tokens per second come from, and the handful of settings worth touching to get more of them.


Most of the speed is spent for you before you open a panel. Kernels are tuned per model and per chip, decode holds the weights in a low-precision format so the memory bus carries fewer bytes per token, and the runtime sizes the model to your machine at load time. The engine pages cover why that part is fast. What follows is the short list of knobs you can move, and what each one trades.

Where the time goes

A request runs in two phases with different costs. Prefill reads your prompt and fills the cache. It is compute-bound and scales with how many tokens you sent. Decode writes the reply one token at a time, and on a local machine it is memory-bound: every new token re-reads the model weights from memory, so sustained tokens per second comes down to how many bytes the chip moves per step.

A long prompt costs you once, at the start, as a slower first token. A long conversation costs you the whole way through, because the KV cache grows with every turn and takes both memory and bandwidth with it. The levers below move one of those two numbers.

The levers that matter

These live under Settings → Advanced Options in the studio. Defaults run fast on the hardware you have, so touch them only when something gives you a reason to.

Max context
The ceiling on the working window, in tokens. 0 means “use the model’s full native window,” the right choice on a machine with room to spare. A lower number caps the window below the model’s native size, which reserves less memory for the cache and frees bandwidth for decode. Reach for this one first when a large model crawls.
KV cache quantization
The cache stores keys and values for every token in the window. Holding them at q8 or q4 instead of fp16 halves or quarters its memory footprint, so a long context fits where it otherwise would not. You pay a small quality loss on long-range recall, not a speed penalty.
Weight quantization
How tightly the weights themselves are packed. Conifer defaults to Q4_K_M because decode moves fewer bytes per token at low precision. A higher level like Q5_K_M or Q8_0 buys back a little quality at the cost of memory and speed. The owner page is Quantization.
Prefix cache
Reuse of the cache computed for a shared conversation prefix, so a follow-up turn does not re-prefill everything you already said. The output is identical, just faster on multi-turn chats. Leave it on unless you are chasing a cache bug.
What each lever moves, and what it costs
LeverBuys youCosts you
Lower max contextLess cache memory, more decode bandwidthA shorter window to work in
q8 / q4 KV cacheA long context that fitsSlightly weaker long-range recall
Higher weight quantA little more answer qualityMore memory, fewer tokens per second
Prefix cache onFaster follow-up turnsNothing; identical output

Memory is the budget

A model running at a token a second is under memory pressure. The cache pre-allocates the entire window up front, so a large transformer at a 128K native window can reserve tens of gigabytes for the cache alone, on top of the weights. Push the total past physical RAM and decode falls into swap.

Conifer guards against this. With no explicit ceiling set, the runtime sizes the default window to what fits free memory, floored so short chats always work and clamped to what the model natively supports. A model whose architecture carries almost no cache, like a hybrid or sub-quadratic model, keeps its full window because there is nothing large to budget for. The mechanics are in Context & memory.

GPU offload and the metal underneath

Apple silicon has no offload dial to turn. The CPU and GPU share one pool of unified memory, so the weights are already on the GPU with no copy across a bus and nothing partial to split. The only lever is whether the working set fits that pool, which is the memory budget above.

On a discrete GPU the picture is different, because the card has its own memory and the working set has to fit it rather than one shared pool. That is where a VRAM ceiling starts to matter: the largest window whose cache fits the card, with longer requests clamped to fit rather than rejected. Conifer runs on Apple’s Metal GPU today, with a portable CPU path that works anywhere a model loads; broader GPU backends are on the way.

Pick before you tune

The biggest performance decision is which model you load, and it lands before any of these knobs. A smaller model, or a sparse mixture-of-experts model that activates a fraction of its parameters per token, can run several times faster than a dense model of the same total size. For measured decode speeds per model, see the models ledger; to match a model to the RAM you have, see choosing by hardware. If a model will not load, or generation stalls outright instead of just running slow, that is Troubleshooting, and any term here that wants a one-line definition is in the Glossary.