Choosing a model
Writing & advice
For drafting, editing, and conversation, the model that reads as a person matters more than the one that tops a benchmark.
Benchmarks reward a correct answer. Writing rewards a good one: the sentence that lands, the paragraph that holds a voice, the edit that fixes the prose without sanding it flat. None of that shows up in a coding pass rate or a math score, so the leaderboard misleads you here. A model two tiers down on the ledger can out-write a heavier reasoner when its tuning leaned toward fluent, well-mannered text.
Haven’t picked a size or quant yet? Read How to choose first. This page assumes the machine is decided and the question is which families handle tone well.
What reads as natural
Fluency comes from instruction tuning and the post-training data, not from raw parameter count. A model trained on careful, varied prose writes careful, varied prose. One trained mostly to pass exams sounds like it is passing an exam. The tell is rhythm. Stiff models open every paragraph the same way, lean on the same connectives, and pad with summary sentences; the ones that read as human vary their length and get to the point.
Two patterns matter for this work. Instruct models answer in your register and follow an editing instruction literally, which is what a rewrite needs. Reasoning models think step by step before they answer. That is right for a proof and wrong for a paragraph: the thinking adds latency, and the prose comes out over-explained. For the distinction, see Base, instruct, and reasoning.
Which models to reach for
Google’s Gemma line has a reputation for conversational polish, and it holds up at sizes a laptop can run. The Mistral and Qwen generalists sit close behind, with stronger tool use if your writing task also touches files or a calendar. The table below is a starting shortlist, not the catalog; per-model size, speed, and intelligence live on the ledger.
| Model | Why for writing | Context |
|---|---|---|
| Gemma 3 12B | Polished writing and strong multilingual prose, with room to spare on a 36 GB machine. | 128K |
| Gemma 3 27B | The most fluent open Gemma; reach for it when the draft has to be good, not just done. | 128K |
| Mistral Small 3.1 24B | Fast, even-handed prose with dependable tool use for writing that also acts on your files. | 128K |
| Qwen 3 8B | A well-rounded everyday writer when memory is the constraint; keep thinking mode off. | 32K |
Context is the native window; larger ones can be extended. See the ledger for exact sizes and measured speed.
On a tighter budget, Qwen 3 4B Instruct (2507 refresh) loads in a couple of gigabytes and drafts cleanly for its size, and Gemma 2 9B still writes well on short inputs. The catch on the older Gemma 2 line is an 8K context window: fine for a chat reply, too small for a long document. Gemma 3 lifts that to 128K. The Gemma family also carries a few architecture quirks worth knowing under heavy use, covered in The Gemma family.
Editing reads differently from drafting
A blank-page draft and a line edit ask for different behavior, and the gap widens at smaller sizes.
Drafting from scratch
Generation forgives a smaller model. An 8B writes a serviceable first pass and you bring the judgment. Decode speed is what you feel, since the model produces every token of a long reply: a model that runs at a comfortable tok/s on your hardware beats a slightly smarter one that crawls.
Editing and feedback
Revision is harder than it looks. The model has to hold your text in context, change only what you asked, and leave the voice intact. Small models drift. They rewrite the whole paragraph when you asked to fix one clause, or they smooth a distinctive sentence into something generic. A larger generalist or a 24B-class model holds the line. On a long piece, the constraint is the window, not the size, and Long context & RAG covers how to keep that fast.
Why this lands differently locally
Writing is where people paste the things they would never send to a cloud endpoint: the unfinished manuscript, the resignation letter, the medical note, the message to a difficult relative. A model on your own hardware reads all of it, and none of it leaves the machine. No account, no log, no retention window to read. That guarantee is the default, not a setting to flip; see The local-first guarantee.
Want one resident all-rounder that also writes well? See General assistant. To work back from the RAM you actually have, start at By your hardware.
To write against a local model from any editor or client, point it at the runtime’s OpenAI-compatible server:
conifer serve --model gemma-3-12b-it