skip to content
General assistant

Choosing a model

General assistant

One model you keep loaded for everything that isn’t a specialist’s job: questions, drafts, quick lookups, the occasional tool call.


The all-rounder stays resident, so its weights are in memory before you start typing. It does not have to win a benchmark. It has to be good enough across the board, steady when you hand it a tool, and quick enough that the wait never registers. On local hardware that is a modern 8B instruct model. Start there. Size up only where you hit a wall.

The safe default

An 8-billion-parameter instruct model is the floor that clears the bar. It answers everyday questions coherently, redrafts text, summarizes a document, and handles light code. Decode runs at tens of tokens a second on a laptop, and at four-bit quantization the weights fit a 16 GB machine with room left for the KV cache.

Qwen 3 8B is the pick for one model and no more decisions. It has a switchable thinking mode for the harder prompts, follows instructions closely, and ships a 32K context window that extends to 128K. Tool calls are workable, not airtight: it gets the simple ones right and slips on a long chain, a limit of the size rather than the runtime. If your work leans on long inputs more than reasoning, Llama 3.1 8B trades a little polish for a native 128K window.

When the 8B isn’t enough

Two failure modes tell you to size up: the model loses the thread on a multi-step task, or its tool calls turn flaky inside an agent loop. Both are capacity problems. Both clear once you spend more memory on the model.

ModelSizeKindWhy step up to it
Qwen 3 8B8.2Bdensethe default; one model for nearly everything on 16 GB
Qwen 3 14B14.8Bdensenoticeably stronger reasoning and reliable tool use, if you have ~24 GB
Qwen 3 30B A3B30.5Bsparselarge-model depth at small-model decode speed, given 32 GB for the experts
Mistral Small 3.1 24B24Bdensea fast, Apache-licensed dense generalist with dependable tool use

Parameter counts, measured decode speeds, and intelligence scores live on the model ledger. This page makes the call; the ledger carries the numbers.

The dense step up

Qwen 3 14B is the same family one size larger, so nothing about how you prompt it changes. Reasoning sharpens, and tool calling crosses from workable to reliable. That matters the moment an agent strings several calls together. It asks for more memory and decodes slower than the 8B. That is the price of the depth.

The sparse step up

Qwen 3 30B A3B is a sparse mixture-of-experts model. It holds 30B of parameters, but a router fires only about 3B per token, so it answers with a large model’s knowledge while decoding at roughly a small model’s speed. Memory is the catch. Every expert sits in RAM whether or not it activates, so the full weights occupy most of a 32 GB machine. Given that headroom, it is the best everyday model on this page, and its tool calling holds up.

Keeping it resident

An all-rounder earns its keep by already being loaded. A cold model pays a one-time cost to read weights off disk into memory before the first token; once resident, that cost is gone until you evict it. Pick one default, keep it warm, and let the specialists load on demand when a task needs them.

The same model backs the local API. Serve it once and any OpenAI-compatible client reaches it on localhost. That is the shortest path to making your everyday model the default everywhere.

terminal
conifer serve --model qwen3-8b

When several tools want different models, name any of them per request and the runtime loads it on demand, evicting cold ones to stay inside your memory budget.

Where to go next

A generalist is the right answer until a task has a sharper one. Code rewards a code-tuned model. Proofs and hard multi-step problems reward a reasoning model. Very long inputs are their own problem, covered in long context and RAG. When memory is the binding constraint, work backward from the RAM you have in By your hardware. The four questions behind all of these live on How to choose.