How models work
How models work
Every model Conifer runs is open weights you download once and run on your own hardware. Two things decide whether a model fits your machine: the numbers in the catalog, and the architecture under each row. This page reads both.
The model ledger gives each model three numbers: size, decode speed on Apple silicon, and a published intelligence score. Those are enough to choose on. What the ledger leaves out is the architecture, which governs how much memory a model takes and how fast it runs on your box. Read the numbers first, then the shape under them.
Read a model in four numbers
A catalog entry is a set of weights, the trained parameters, plus the architecture that runs them. Four properties settle whether it fits your machine and your task.
- parameters
- Total weight count, in billions. Paired with a quantization it fixes the on-disk and in-memory size of the weights: a 7B model at Q4_K_M is roughly 4 GB. Bigger knows more and always runs slower.
- active parameters
- How many parameters run per token. For a dense model that equals the total. For a Mixture-of-Experts model it is far smaller. A
30B A3Bholds 30B parameters but activates about 3B on each token. Active parameters drive speed; total parameters drive memory. - decode speed
- Tokens per second once a reply is streaming, the rate you feel while you read. It is memory bandwidth divided by the bytes touched per token, so it tracks active parameters and quantization, not the total parameter count sitting in memory.
- intelligence
- A published benchmark score: MMLU on the ledger, the Artificial Analysis composite on the chart. A proxy for capability, not a guarantee. Check it against your own task.
Three architecture classes
Every model in the catalog belongs to one of three classes. The class names the shape of the memory-against-speed trade, so read it before you download anything.
| Class | Active / token | Memory | Best when |
|---|---|---|---|
| Dense | all parameters | = parameters | you want predictable, well-supported quality |
| Mixture-of-Experts | a few experts | = all experts (large) | you have the RAM and want speed with breadth |
| Hybrid & sub-quadratic | all, but cheaply | tiny KV cache | you need long context to stay fast |
Dense pays full price for predictability. MoE buys a large model's knowledge at a small model's compute, if you can hold every expert in memory. Hybrids spend their savings on context length.
Each class has its own page, with the memory math and the catalog models worked through: Dense models, Sparse & MoE models, and Hybrid & sub-quadratic. The Gemma family is dense, but a heavy embedding table and sliding-window attention give it enough quirks to read on its own terms before you run one.
Families and tuning
Inside the classes, the catalog groups by family: a lineage that shares a tokenizer, a training recipe, and an architecture. Picking a family is picking a temperament.
- Llama, Meta’s line: the most broadly compatible weights, and the safe default at every size.
- Qwen: the widest range in the catalog, from a 0.6B that runs anywhere to 32B dense and the 30B-A3B MoE. Strong at code and reasoning.
- Gemma, Google’s open lineage: capable for its size, with the architecture its own page lays out.
- DeepSeek: reasoning-distilled variants of Qwen and Llama, plus the V2-Lite MoE.
- Phi, Microsoft’s small data-curated models: they answer above their parameter count.
- LFM2, Liquid’s hybrid line: sub-quadratic, the fastest route to long context on a small machine.
The family fixes the architecture. A second axis cuts across all of them, the tuning: the same weights ship as a base model, an instruct model, or a reasoning model, and that choice shapes behavior as much as size does. Base, instruct & reasoning covers which to pull, and why almost everything wants the instruct one. For what a model can take in and produce, see Text, vision & audio.
What “supported” means
Conifer runs a model when the engine has a kernel for its architecture, not its name. A release that reuses a supported architecture, another Qwen 3 dense, say, runs the day its weights publish, with no engine work. A genuinely new architecture is load-gated until the engine ships a kernel for it: it sits in the catalog so you know it is coming, but it will not load yet. The changelog records when each architecture goes live.
Run one
Name a supported model and the runtime does the rest, fitting the quant and the context window to your machine. From the local API:
conifer serve --model qwen3-8bWhy that one line runs fast, kernels matched to the model shape and memory-aware fitting, lives in Inside the engine. To pick by task instead of architecture, start at Choosing a model, which names the family, class, and tuning for each recommendation and links back here.