How models work
Dense models
Every parameter runs on every token. That one fact pins a dense model’s memory and its speed, and it is why dense is the default.
A dense model is the plain transformer. Producing one token reads all of its weights, every layer, no router and nothing to skip. The whole network fires on each step, and none of the work is conditional. It is the easiest class to reason about, the best supported across backends, and the one to reach for until you have a reason not to. Two numbers matter, memory and decode speed, and both fall out of “all of it, every token.”
Memory is the parameter count
The weights are the bill. Their size is the parameter count times the bytes each weight takes at your quantization, and for a dense model at rest that is nearly the whole bill:
- f16, 2 bytes per weight
- Full half precision. An 8B model is about 16 GB. Rarely worth it locally.
- Q8_0, around 1 byte per weight
- Near-lossless 8-bit. An 8B model is about 8 GB.
- Q4_K_M, around 0.56 bytes per weight
- The local default: 4-bit, with the quality-sensitive tensors kept at higher precision. An 8B model lands near 4.5 GB for very little quality cost. Every decode figure in the catalog is measured at this quant.
The weights leave out one thing: the running conversation. Every token you keep open adds to the KV cache, and on a dense model that cache grows linearly with context. Push the window far enough and it outweighs the model itself. That crossover is where a hybrid architecture starts to pay off.
Decode is bandwidth-bound
Generating a token streams every weight from memory through the compute units once. The arithmetic is cheap; the read is not. Decode speed is memory bandwidth divided by the bytes touched per token, and for a dense model that is the entire weight set. Double the parameters and you roughly halve the tokens per second. The catalog shows the line plainly on an M3 Max at Q4_K_M:
| Model | Params | tok/s | MMLU |
|---|---|---|---|
| Llama 3.2 1B | 1.2B | 299 | 49.3 |
| Llama 3.2 3B | 3.2B | 118 | 63.4 |
| Qwen 3 4B | 4.0B | 90 | 73.0 |
| Llama 3.1 8B | 8.0B | 49 | 69.4 |
| Qwen 3 8B | 8.2B | 46 | 76.9 |
| Qwen 3 14B | 14.8B | ~25 | 81.1 |
| Qwen 3 32B | 32.8B | ~12 | 83.6 |
A ~ value is projected from a measured sibling's bandwidth, never a claim. The curve is almost straight because the work per token is almost all memory traffic, so tok/s tracks bytes read, which for a dense model is the parameter count.
That is the whole trade. A bigger dense model knows more and answers slower, in lockstep, with no router picking which one you get on a given turn. The model ledger carries the full per-model figures if you want to read the slope yourself.
When to choose dense
Dense is the predictable choice: a known architecture, wide kernel support across Metal, CUDA, and CPU, and quality that does not hinge on a router making the right call. Size up until decode stops feeling fast enough on your hardware, then stop. On a typical laptop that ceiling sits around 7B to 8B. Given more unified memory, 14B and 32B run fine for work that does not need to keep pace with your typing.
Run one
Name a dense model and that is enough. The runtime fits the quant and the context window to your machine. From the local API:
conifer serve --model qwen3-8bA new dense release that reuses a supported architecture, say another Qwen 3 dense, runs the day its weights publish with no engine work. The companion choice, which tuning of those weights to pull, lives on Base, instruct & reasoning.