skip to content
Long context & RAG

Choosing a model

Long context & RAG

When the window is the bottleneck, the model you pick and the size you run it at decide whether long context stays fast or crawls.


The native window is the headline number on most model cards, and the one most likely to mislead you. What a model can hold and what it stays fast at on your machine are two different numbers. A 128K window you cannot fit in memory is not a feature; it drops decode to a token a second. Below: the models that give you real working context, and the one setting that keeps the window honest.

What actually costs you

Two things grow with context, and they grow differently. The KV cache grows with the length of the window, because every token you keep in context leaves its keys and values behind for the next token to attend to. Attention work grows with the square of the sequence in a standard transformer, because each new token looks back at every prior one.

The KV cache bites first on a local machine. The engine reserves it up front, sized to the window you run at, so a long window books a large block of memory before you type anything. On a unified-memory Mac that block competes with the weights for the same RAM. Push past what is physically there and decode falls off a cliff into swap. The Context & memory page works the arithmetic. The window is a memory decision before it is a capability one.

The architectures that stay fast long

Two model classes change the shape of that cost. Both are worth reaching for when context is your constraint.

Hybrid and sub-quadratic models
These swap some or all of the quadratic attention for a recurrent state that carries forward at a fixed size. The KV cache stays small however long the conversation runs, so memory does not balloon with the window and decode holds its speed as history grows. On a small machine, this is the cheapest path to long context.
Sparse and MoE models
A mixture-of-experts fires only a few experts per token, so it decodes at a small model’s speed even at depth. The KV cache does not shrink, and every expert has to sit in memory. Given the RAM, it reads a long document with the depth of a large model and the latency of a small one.

A dense transformer still handles long inputs fine when it fits. Its cost is the steepest of the three, so the window you can afford on it is the smallest.

Models to reach for

Concrete picks, sorted by the machine you are on. Native window is the ceiling the weights support; what you run at depends on your memory, which the next section handles.

ModelNative windowWhy
Qwen 3 30B A3B 2507256KMoE: large-model reading at small-model latency, if you have ~32GB+
Qwen 3 Coder 30B A3B256KRepository-scale code over a long window, agentic tool format
Gemma 3 (4B / 12B / 27B)128KA long window at every size, from a 4B that fits anywhere
Llama 3.1 8B128KA dependable dense long-context workhorse for summarizing

Native windows from the model cards. Gemma 3 replaces Gemma 2's short 8K window with 128K; that jump is the reason to prefer it for long inputs at small sizes.

Read decode speed, download size, and intelligence for each off the model ledger instead of guessing from the window. A long window on a slow model is still slow.

Set the working window, not the native one

The setting that matters most for long context is the one you almost never touch. Conifer sizes a working window to the memory you have free, instead of running every model at its full native ceiling. That ceiling is a trap: a 24B transformer at 128K reserves roughly 20GB of KV cache on top of its weights, past physical RAM on a 36GB Mac, and decode collapses to a crawl. The auto window keeps the cache inside free RAM and floors at a length short chats never reach. You get the longest context that stays fast, not the longest the card advertises.

RAG and feeding documents

Retrieval trades window length for selection. Instead of stuffing a whole corpus into the prompt, you fetch the few passages that bear on the question and pass only those. The window stays short, the KV cache stays small, decode stays fast, and you sidestep a known failure mode: models attend less reliably to the middle of a very long prompt.

Run retrieval against a local model through the local server, which speaks the OpenAI-compatible API your RAG framework already targets. Point the framework’s base URL at Conifer and embeddings and generation both run on your hardware, with nothing leaving the machine.

terminal
conifer serve --model qwen3-30b-a3b

Where to go next

When long documents are the whole workload, the Law and Science & research pages cover the careful-reading case. Not sure the window is even your constraint? Start from How to choose or work back from your hardware.