skip to content
Sparse & MoE models

How models work

Sparse & MoE models

A small router picks a few experts to run on each token, so the model computes like a small one and remembers like a large one, as long as you can keep every expert in memory.


“Sparse” means most of the network sits idle on any given token. The form in the catalog is Mixture-of-Experts (MoE). It takes the dense feed-forward block a dense model runs in full and splits it into many parallel experts. A small router reads each token, scores the experts, and fires only the top few. The catalog writes that as 30B A3B: 30 billion parameters in total, about 3 billion active on any one token. Qwen’s 30B-A3B holds 128 experts and lights up 8 of them per token.

Compute of a small model, memory of a large one

A dense model pays one bill: every weight it stores, it also reads on every token. MoE splits that bill into two halves that move in opposite directions.

  • Speed follows the active parameters. Decode is bandwidth-bound, so tokens per second come from the bytes moved per token, not the bytes on disk. A 30B-A3B streams roughly 3B worth of weights each step and decodes at about the speed of a 3B dense model.
  • Memory follows the total parameters. The router can call any expert on the next token, so every expert has to be resident before the first one runs. A 30B-A3B wants ~30B of weights in RAM, around 18 GB at Q4_K_M, even though it runs like a 3B.

That is the whole trade. You get a large model’s breadth at a small model’s decode speed, and you pay a large model’s memory to keep the option open. Give it the RAM and it is the best speed-for-knowledge ratio in local inference. Starve it and an MoE spills to swap, which throws away the one thing that made it worth running.

What the router actually does

The router is a tiny matrix. It maps each token to a score per expert, takes the top few, and normalizes their weights. The engine runs that selection on the GPU and gathers only the chosen experts, so no per-layer scores travel back to the CPU. You pay a little bookkeeping. In return, 125 of 128 experts stay parked in memory and untouched, which is why decode stays fast.

MoE models in the catalog

Conifer runs the Qwen 3 30B-A3B line on Metal today, including the Thinking 2507 refresh. A 2026 wave added more MoEs from across the field — OpenAI’s gpt-oss, IBM’s hybrid Granite 4.0, Baidu’s ERNIE 4.5, InclusionAI’s Ring — listed so you can see them coming. Most wait on an engine kernel for their architecture; one is a workstation-class download.

ModelTotalActiveStatus
Qwen 3 30B A3B (2507)30.5B~3Bruns today · qwen3moe
Qwen 3 Coder 30B A3B30.5B~3Bruns today · qwen3moe
Qwen 3 30B A3B Thinking 250730.5B~3Bruns today · qwen3moe
GPT-OSS 20B21B~3.6Bload-gated · gpt-oss
Granite 4.0 H Tiny7B~1Bload-gated · granitehybrid
Granite 4.0 H Small32B~9Bload-gated · granitehybrid
ERNIE 4.5 21B A3B Thinking21B~3Bload-gated · ernie4_5-moe
Ring mini 2.016B~1.4Bload-gated · bailingmoe2
DeepSeek V2 Lite15.7B2.4Bload-gated · deepseek2
Qwen 3 235B A22B235B22Bsharded · large-RAM only

“Runs today” means the engine has a Metal kernel for the architecture. “Load-gated” means it is listed but not yet runnable, pending a kernel. The 235B ships as multi-part weights for workstation-class memory. For decode speeds and intelligence scores, see the model ledger.

The Coder variant carries the same sparsity as the Instruct 2507. It is tuned for agentic coding and a function-call format built for editor and terminal agents. Whether a model is instruct, base, or reasoning is a separate axis from how it’s wired; an MoE can be any of them.

Sizing one before you download

Read the total, never the active count, to decide whether a model fits. The weights dominate the budget. On top of them sits the KV cache, which grows with the context you keep open. A rough fit for a 30B-A3B at Q4_K_M:

memory
weights   ~18 GB   (all 128 experts resident, Q4_K_M)
KV cache  + grows with context length
--------
budget     keep headroom; 32 GB unified memory and up

When to reach for one

Reach for an MoE when you want stronger answers than a same-speed dense model gives and you have the memory headroom to hold every expert. The clean case is 32 GB or more of unified memory running a 30B-A3B. It answers like a model far larger than its decode speed would suggest, which makes it a good everyday flagship on a capable Mac. Below that memory line, stay dense.

Long context wants a different lever. An MoE shrinks compute per token but leaves the cache untouched, so when the window is the bottleneck the answer is a hybrid model, which keeps the KV cache small to begin with. Picking by task instead of by architecture? Choosing a model names the right one for each job.