Choosing a model
By your hardware
Memory is the ceiling on a local machine. Work the decision from there: start with the RAM you have, and read off the models that fit and stay fast.
The other pages in this section start from the work. This one starts from the metal. A model sits in memory the whole time it runs, and on your own hardware that budget is fixed: the weights take a block, the KV cache takes another, and whatever the operating system and your open apps already hold is spoken for before you begin. Overshoot and the machine starts swapping, which drags decode down to a token a second. Pick from what fits and it stays quick.
What fills the memory
Two things share the budget, and they grow for different reasons. The weights are the larger block and the easier one to predict. At Conifer’s 4-bit default, a dense model needs roughly its parameter count in bytes plus a little overhead, so an 8B lands near 5GB and a 32B near 20GB before any context. The sparse case catches people out. A mixture-of-experts decodes at a small model’s speed, since only a few experts fire per token, but it still has to keep every expert resident. Qwen 3 30B A3B activates about 3B parameters per token and occupies roughly 18GB all the same. Speed and footprint come apart there.
The KV cache is the second block. The engine reserves it up front, sized to the window you run at, so a longer context costs memory before you type a word. To keep the two blocks inside what you have, the runtime caps the default reservation near 2GB and fits a working window to the free memory it reads. The arithmetic lives in Context & memory. A model’s real cost is its weights plus the window you mean to run, not the weights alone.
Where each budget lands
Read your machine’s RAM, then read across. These are starting points for the everyday case, not ceilings. The figures are weights at 4-bit; leave a few gigabytes of headroom for the cache and the rest of the system.
| You have | Start with | Weights | Why this one |
|---|---|---|---|
| 8 to 16 GB | Qwen 3 4B Instruct 2507 | ~2.5 GB | Loads fast, leaves room for the cache and your apps, and answers above its size. |
| 16 to 32 GB | Qwen 3 8B | ~5 GB | The everyday default: quick to load, steady enough to drive tools. |
| 32 to 64 GB | Qwen 3 30B A3B 2507 | ~18 GB | MoE depth at small-model latency, once all the experts fit. |
| 64 GB and up | Llama 3.3 70B / Qwen 2.5 72B | ~43 to 47 GB | Frontier-class dense quality when memory stops being the constraint. |
Footprints are approximate 4-bit weights from the model cards. Measured decode speeds and exact download sizes are on the model ledger.
The boundaries are soft. A 36GB Mac runs a 32B with headroom and a 30B-A3B comfortably; a 70B fits only if you squeeze the window and accept slower decode. When the model you want is bigger than your budget, take the smaller one that fits over the large one that thrashes. An 8B at full speed beats a 32B that crawls.
The app reads the machine for you
None of this arithmetic falls on you. Each model card carries a fit verdict, computed against the memory Conifer reads on your machine, so the call is made before you download. It resolves to one of three honest states.
- Fits
- The weights and a working cache land with comfortable headroom. Run it and forget about memory.
- Tight
- It fits, with little room to spare. A long session, other apps, or a growing cache can eat the slack, so keep the window modest.
- Won’t fit
- The weights exceed usable memory. The card names what it needs and what you have, in gigabytes, rather than letting you find out the slow way.
When the memory ceiling can’t be read, the card says so plainly rather than guess a confident “fits.” The verdict leans conservative on purpose: it calls a close one tight before it promises headroom that isn’t there.
When the model is bigger than the budget
Two levers buy real room before you give up on a size. Start with the window. Conifer fits a working context to free memory by default, but when a large model sits on the line you can lower the max-context ceiling in advanced options and reclaim the cache reservation directly. A shorter window is a smaller KV block, and that block is often what tips a “tight” model into thrashing.
The other lever is architecture. A hybrid or sub-quadratic model keeps a tiny cache that barely grows with length, so its memory cost is nearly all weights and the window stops mattering. When long context is what pushes you over budget, that class is the cheapest way to stay fast on a small machine. The Long context & RAG page covers the window in depth.
conifer serve --model qwen3-30b-a3b --ctx 8192Where to go next
You now have a short list that fits. To choose among it by the work you do, go back to How to choose and follow your task, or jump straight to Coding, Reasoning & math, or General assistant. For the exact footprint, decode speed, and intelligence of any model on one axis, read it off the model ledger.