skip to content
Quantization

How models work

Quantization

A model trains in 16-bit floats and ships in fewer. Quantization is the recipe that shrinks each weight; the Q-level names how hard it shrinks.


The weights you download are not the weights the model trained on. Training runs in 16-bit floating point, two bytes per parameter, which puts a 14B model around 28 GB. Quantization stores each weight in fewer bits and reconstructs an approximate float at read time. The same 14B at the common 4-bit recipe lands near 8 GB, small enough to keep resident on a laptop with room left for the KV cache. You pay a little accuracy for that. The trick is spending the bits you keep where they change the output.

GGUF, the file you download

GGUF is the on-disk format Conifer loads. One file carries the quantized weights, the tokenizer, the chat template, and a header that names the architecture (llama, qwen3, gemma2, and so on). The engine reads that header, matches it to a forward pass, and streams the weights in their stored precision. Nothing to convert, no separate config to wire up. The recipe is baked into the file, and the file is the unit you pick.

That is why one model shows up in the catalog as several rows. Each row is a different GGUF: the same trained weights at a different quantization, with its own size on disk and its own quality. The name is the recipe.

Reading a Q-level

A name like Q4_K_M packs three facts. The number is the bit budget per weight. K marks a k-quant, which splits a tensor into 256-weight blocks and stores a scale and minimum per block, so quiet regions and loud ones each get their own range instead of one scale for the whole tensor. The trailing letter, S / M / L, is the size mix. It spends extra bits on the tensors that hurt most under rounding, attention and embeddings, and squeezes the rest. M is the balanced middle, and the one most builds ship.

Bytes per weight is the unit that matters, because it sets both the download and the memory footprint. A k-quant block carries a few bytes of scales on top of the packed values, so the real cost runs a little above the headline bit count.

The recipes you will actually meet, ordered by size
RecipeBytes / weight14B on diskWhat it is for
Q4_K_M~0.56~8 GBThe default. Near-full quality at the smallest sane size.
Q5_K_M~0.7~10 GBA step closer to the original, when memory allows.
Q6_K~0.82~12 GBEffectively lossless for most work.
Q8_0~1.06~15 GB8-bit, indistinguishable from full precision.

Bytes per weight are the on-disk cost including each block's scales, so a 4-bit recipe lands near 0.56 rather than 0.5. Sizes are rounded for a 14B model; the catalog lists the exact bytes per row.

Below 4-bit live the i-quants (IQ3, IQ4_XS), which use a learned codebook to claw back quality at three-ish bits. They earn their keep on very large models, where the gap between fitting in memory and not fitting is the whole question. On a model that already fits, reach for them only when you have to.

What you give up

Quantization error is not uniform across the curve. From full precision down to Q6_K, the loss is hard to measure on real tasks. At Q4_K_M it shows up, but small: a slightly worse worst case on long reasoning chains and tight formatting. The drop turns real below 4-bit, and a small model feels it sooner than a large one, since it has fewer parameters to absorb the rounding. A 4-bit 30B beats an 8-bit 7B at the same memory budget. That is why the better trade is almost always more parameters at lower precision, not fewer at higher.

Why the runtime picks one for you

You can pick a recipe by hand, but you rarely should. Conifer reads the machine it is on, then offers the largest model and the best quantization that still leave headroom for the context window you want open. The fit verdict on each catalog row runs the same arithmetic: weights at the row’s bytes per weight, plus the KV cache, against usable unified memory.

The rule it encodes: take Q4_K_M unless you have memory to spare, and when you do, step up to Q5_K_M or Q6_K on the same model before you reach for a different one. Trading a 4-bit large model for an 8-bit small one points the wrong way.

Quantization sets how many bytes each weight costs. How many weights stay resident at all is the other half. A dense model holds every weight on every token; an MoE holds them all but reads a few, and both pay the quant rate on what they store. The model ledger lists the per-model sizes, decode speeds, and intelligence scores. To pick by task and let the runtime size the fit, start at Your first model.