Quantization

Quantization stores each weight in fewer bits than it trained in; the Q-level names how hard it shrinks.

Training runs in 16-bit floating point, which puts a 14B model around 28 GB. At the common 4-bit recipe the same model lands near 8 GB, small enough to stay resident on a laptop with room left for the KV cache. The cost is a little accuracy.

GGUF, the file you download

GGUF is the on-disk format Conifer loads: one file with the quantized weights, the tokenizer, the chat template, and a header naming the architecture. Each catalog row is the same trained weights at a different quantization.

Reading a Q-level

A name like Q4_K_M packs three facts: the number is the bit budget per weight, K marks a recipe that quantizes in small blocks so each region gets its own scale, and the trailing S / M / L is the size mix. M is the balanced middle, and the one most builds ship.

Common recipes, sized for a 14B model
Recipe	14B on disk	What it is for
`Q4_K_M`	~8 GB	The default. Near-full quality at the smallest sane size.
`Q6_K`	~12 GB	Effectively lossless for most work.
`Q8_0`	~15 GB	Indistinguishable from full precision.

What you give up

Down to Q6_K the loss is hard to measure. At Q4_K_M it shows up, but small. The drop turns real below 4-bit, and a small model feels it sooner than a large one.

Why the runtime picks one for you

You can pick a recipe by hand, but you rarely should: Conifer offers the largest model and the best quantization that still leave headroom for the context window you want open.

The model ledger lists per-model sizes and decode speeds. To pick by task, start at Your first model.

GGUF, the file you download#

Reading a Q-level#

What you give up#

Why the runtime picks one for you#

GGUF, the file you download

Reading a Q-level

What you give up

Why the runtime picks one for you