Choosing a model
Coding
For code, a smaller model trained on code usually beats a larger general one, and it decodes faster while doing it.
“Coding” is two jobs wearing one word. Generation writes the function and refactors the file. Reasoning about code explains the bug and plans the change. Most code-tuned models handle both; a few lean hard one way. Everything on the short list below is an instruct model unless the row says otherwise.
The short list
Five models cover almost every local coding setup, from a 16 GB laptop running a 7B to a 32 GB machine running a sparse 30B at small-model speed.
| Model | Size | Kind | Good for |
|---|---|---|---|
| Qwen 3 Coder 30B A3B | 30.5B | sparse | the strongest local coder, if you have the RAM for the experts |
| Qwen 2.5 Coder 7B | 7.6B | dense | the everyday coder; fast on a laptop |
| Qwen 2.5 Coder 1.5B | 1.5B | dense | autocomplete-class, runs on almost anything |
| Qwen 3 8B | 8.2B | dense | a general model that codes well, when you want one model for everything |
| DeepSeek R1 Distill Qwen 14B | 14.8B | reasoning | planning a change, untangling a subtle bug |
Sizes and measured decode speeds live on the model ledger. “reasoning” marks a reasoning-tuned distill; see Base, instruct & reasoning for what that changes.
How to pick among them
Two questions decide it. How much memory you have, and whether the task is generation or reasoning.
By the RAM you have
- 16 GB or less
- Qwen 2.5 Coder 7B is the sweet spot. Drop to the 1.5B for instant completions on a small machine, trading depth for latency.
- 32 GB or more
- Qwen 3 Coder 30B A3B. It is a sparse model, so only a few experts fire per token and it decodes like something far smaller, but the full weights still sit in memory. Give it the headroom and it is the best local choice for code.
Generation versus reasoning
Writing and refactoring want a coder model. Reach for Qwen Coder and let it produce the diff. Planning a multi-file change, or chasing a bug that hides behind two layers of indirection, rewards a reasoning model. An R1 distill thinks out loud before it answers. It catches more, and it costs a burst of tokens up front for the privilege.
A note on autocomplete
Inline completion (fill-in-the-middle) is a base-model job, not an instruct one. The model finishes the code around your cursor instead of chatting about it, so chat-style tuning gets in the way. The small Qwen Coder builds are the right size and shape for it. The tuning distinction is laid out in Base, instruct & reasoning.
Wire it into your editor
Pick a model, serve it on localhost, and point your editor or agent at the local address.
conifer serve --model qwen2.5-coder-7bThe local API speaks the OpenAI-compatible protocol, so Claude Code, Cursor, and Continue connect by swapping the base URL and nothing else. When the generated code has to parse on the first try, pin it to a grammar with structured output. It constrains decoding at the token level, so the output is valid by construction.