skip to content
The Gemma family

How models work

The Gemma family

Google’s open lineage runs dense and punches above its size. It also carries enough architectural quirks to be worth reading on its own terms before you run one.


Gemma is a dense family, so the memory and speed rules from that page apply unchanged. Weights dominate the budget, and decode speed tracks parameter count. The architecture is what earns it a page: it leaves the Llama and Qwen mainstream in three places, and each one changes how a Gemma fits your machine or how faithfully an engine has to run it.

What makes Gemma different

Three details set Gemma apart from a same-size Llama. None is exotic. The engine has to handle all three, and one of them shows up in your memory budget.

A heavy embedding table
Gemma trains on a very large vocabulary, around 256k tokens, so its token-embedding matrix is correspondingly large. On a small model that table is a real fraction of the weights: a 2B Gemma carries more fixed overhead than a 2B Llama, so price it in when you size one against your memory. The embeddings are also scaled by the square root of the hidden size before the first layer, a Gemma detail the loader has to apply or the model drifts.
Sliding-window attention
Most Gemma layers attend to a local window, not the full sequence. That keeps the KV cache smaller at long context than a pure global-attention model of the same size. Gemma 2 alternates local and global layers one for one. Gemma 3 widens that to five local layers per global one, part of how it reaches a 128k window where Gemma 2 stopped at 8k.
Logit soft-capping
Gemma 2 bounds its attention scores and its output logits with a tanh soft cap (50 on attention, 30 on the final logits) rather than letting them run free. Get the cap wrong and the model does not crash. It degrades quietly, the worst failure mode to debug. An engine either honors this exactly or gets it subtly wrong.

The generations

Each generation reshapes the architecture rather than retraining it, so what runs depends as much on the engine version as on the weights. Gemma 3 adds vision upstream; Conifer runs the GGUF text path (see Text, vision & audio).

Gemma in the catalog · decode on M3 Max · Q4_K_M · 512-token prompt
ModelParamsContexttok/sMMLU
Gemma 2 2B2.6B8k13851.3
Gemma 2 9B9.2B8k~4171.3
Gemma 3 4B4.3B128k6659.6
Gemma 4 12B11.9B128k1574.5

~ projected from a measured sibling's bandwidth, never a claim. Gemma 4 is the newest generation and the most capable per parameter in the family. For the full ledger, see /models/.

Choosing a Gemma

Gemma 2 2B is one of the fastest useful models in the catalog, which makes it a good small assistant to keep resident, so long as your inputs stay under its 8k window. Gemma 3 4B trades some of that speed for capability and a 128k context, the window that makes Gemma 3 hold a long document at all. Reach for Gemma 4 12B when you want the family’s strongest answer and can pay for the slower decode. Whichever you pick, count the embedding table when you decide what fits.

To see what runs today, name a Gemma in the CLI and let the runtime fit it to your memory:

terminal
conifer run --model gemma-3-4b