Choosing a model
Reasoning & math
Some questions are won by thinking longer before answering. Reasoning models spend tokens to do that, and on proofs, multi-step word problems, and tricky logic, those tokens buy a better answer.
A normal instruct model answers from the first token. A reasoning model writes out a chain of working first, inside a hidden scratch span, then commits to an answer. That extra pass is the whole mechanism. On a hard algebra problem or a multi-step deduction, the working catches the mistake a one-shot answer would have shipped. On a question with a short, obvious answer, the same working is latency you paid for nothing. Knowing which case you’re in is most of the choice. For the line between base, instruct, and reasoning tuning, see Base, instruct & reasoning.
The reasoning-tuned line
The clearest reasoners in the catalog are the DeepSeek R1 distills: R1’s step-by-step behavior trained onto a smaller, locally runnable backbone (Qwen or Llama). They think out loud before answering, and for their size they lead on math and structured logic. They are specialists. Tool calling is not what they’re built for, and a small one can loop inside its own reasoning, so keep them on the problem and off the chat.
| Model | Size | Good for |
|---|---|---|
| DeepSeek R1 Distill Qwen 32B | 32.8B | the hardest local reasoning and math, on 32GB+ |
| DeepSeek R1 Distill Qwen 14B | 14.8B | strong math and logic at a comfortable size |
| DeepSeek R1 Distill Qwen 7B | 7.6B | step-by-step problem solving on a laptop |
| Phi-4 | 14.7B | math, logic, and STEM Q&A (gated until Metal-verified) |
Sizes, measured decode speed, and AIME/GPQA figures live on the model ledger. The 7B and 14B distills run on modest hardware; the 32B wants a 32GB+ machine.
One thinking model doubles as a competent generalist. DeepSeek R1 0528 Qwen3 8B puts R1 reasoning on a Qwen 3 backbone, so it reasons step by step yet still holds an ordinary conversation. Microsoft’s smaller Phi-3.5-mini holds its own on logic at a fraction of the size, trading away world knowledge for it.
Thinking modes vs. dedicated reasoners
Not every reasoner is a separate model. The Qwen 3 line ships a switchable thinking mode: one set of weights that answers directly when you want speed and reasons step by step when you turn it on. A capable Qwen 3 (8B, 14B, or the 30B-A3B sparse model) with thinking enabled covers most of the ground an R1 distill does, and stays a dependable tool caller with thinking off. The cost is the same either way. Thinking mode adds latency because the model writes its working before it writes your answer.
- Reach for a dedicated R1 distill
- When the task is the reasoning itself: competition math, proofs, puzzles, careful multi-step deduction, and nothing else is riding on the same turn.
- Reach for a Qwen 3 with thinking on
- When you want one resident model that reasons when asked and still calls tools, drafts, and chats the rest of the time.
When the extra tokens pay off
Reasoning isn’t free. A thinking model can emit hundreds of tokens of working before the first word of its answer, and on local hardware every one of those tokens is wall-clock time you sit through. The working also fills the context window, so a long reasoning trace eats room a long document would need. Spend it where it changes the answer.
- Worth it: multi-step arithmetic and algebra, word problems, proofs, constraint and logic puzzles, anything where a confident wrong answer is the failure mode.
- Skip it: lookups, short factual questions, rewriting, summaries, and tight back-and-forth chat, where the working only slows the reply and rarely changes it.
Reading the working, not just the answer
The reasoning span is part of the value, not noise. The studio keeps it separate from the final answer so you can read the chain and check where a result came from. The same separation holds over the local API. Opt into reasoning on a request and the model’s working returns in its own field, never mixed into the answer text, so a downstream parser sees clean output.
# serve a reasoner, then ask with thinking on
conifer serve --model deepseek-r1-distill-qwen-14b
curl localhost:8080/v1/chat/completions -d '{
"model": "deepseek-r1-distill-qwen-14b",
"messages": [{"role": "user", "content": "Prove that the square root of 2 is irrational."}],
"enable_thinking": true
}'When you need the answer in a fixed shape (a number, a JSON object, one of a set of labels), let the model reason freely and constrain only the final output with structured output. The working stays free-form; the answer is valid by construction.
Where to go next
Reasoning overlaps with two neighboring tasks. For planning a change or untangling a subtle bug, a reasoner pairs well with a coder, which the Coding page lays out. For graduate-level technical reading and research questions, see Science & research. To put the candidates side by side on AIME, GPQA, and decode speed, the model ledger carries the numbers this page only points at.