Choosing a model
How to choose
Running a model is the easy part. Picking one out of a hundred is the real decision, and four questions settle it: the task, the tuning, the modality, and the hardware you have.
The runtime handles the mechanical part. It fits your pick to the memory on hand, picks a quantization that lands without swap, and sizes the context window so decode stays fast. The one thing a config file can’t answer is which model is good at the work you do. Four questions below cut a hundred candidates to a short list. Skip to the last section to let the defaults decide.
The task
The task moves the most. A 7B model trained on code out-writes a much larger general model at code, then loses to it on prose. Start from the work in front of you, not the leaderboard. Each page below carries its own short list and the reasons behind every pick.
| If you're doing | Go to |
|---|---|
| Writing or reasoning about code | Coding |
| Math, proofs, hard multi-step problems | Reasoning & math |
| Drafting, editing, and conversation | Writing & advice |
| Technical reading and research questions | Science & research |
| Contracts, citations, careful reading | Law |
| Long documents or retrieval-grounded work | Long context & RAG |
| A bit of everything | General assistant |
The tuning
The same weights ship in different tunings, and the tuning shifts what a model is good for more than a few billion parameters would. Three of them matter.
- Instruct
- Trained to follow instructions and hold a conversation. The default; almost every pick across this section is one.
- Reasoning
- Thinks step by step in the open before it answers. Worth the extra tokens on math, proofs, and tangled multi-step problems; wasted on a one-line question.
- Base
- Untuned text completion. Reach for it only for fill-in-the-middle and similar completion jobs, never for chat.
For when each one wins, and why a reasoning model can beat a larger instruct model on a hard problem, see Base, instruct & reasoning.
The modality
Conifer runs the text path today. Several catalog models are vision-capable upstream, but image input rides a separate projector file the runtime doesn’t load yet, and audio isn’t exposed at all. Choose for text. Text, vision & audio tracks what the GGUF path does and doesn’t carry.
The hardware
Capability costs memory, and on a local machine memory is the ceiling. A dense model needs roughly its parameter count in bytes at the 4-bit default, so an 8B lands near 5GB of weights plus room for the KV cache. A sparse model decodes at a small model’s speed because only a few experts fire per token, but it still has to hold every expert in memory: Qwen 3 30B A3B activates about 3B parameters yet occupies roughly 18GB.
The runtime won’t let a model thrash, but it can’t conjure RAM. By your hardware works backward from your machine to the models that fit and stay fast. Per-model footprints and decode speeds live on the model ledger.
If you don’t want to think about it
Keep a well-rounded 8B instruct model resident and trade up only when a task asks for it. Qwen 3 8B and Llama 3.1 8B both decode comfortably on a laptop and cover most everyday work. Either one is a baseline to judge everything else against.
| You have | Start with | Why |
|---|---|---|
| 8 to 16 GB | Qwen 3 4B Instruct 2507 | Loads in ~2.5 GB and punches far above its size. |
| 16 to 32 GB | Qwen 3 8B | The everyday default: fast, capable, and steady enough for tool use. |
| 32 GB and up | Qwen 3 30B A3B 2507 | MoE depth at small-model speed, once the experts fit. |
Footprints and measured decode speeds are on the model ledger.
Serve the model you settle on over the OpenAI-compatible API on localhost and point your existing tools at it.
conifer serve --model qwen3-8b