Hybrid & sub-quadratic

A dense transformer keeps a KV cache that grows with every token. Hybrid and sub-quadratic models hold most of that cost flat, so long context stays fast on a small machine.

Attention in a standard transformer is quadratic in sequence length, and its KV cache grows by a fixed number of bytes for every token you keep. That cache is what turns a fast model slow once the conversation runs long. Hybrid and sub-quadratic architectures attack the cache itself. They swap most attention layers for cheaper operators that carry a small, fixed-size state instead of a growing K/V buffer, and keep a few real attention layers for the long-range recall that pure attention does best.

What the cache costs

A dense attention layer remembers the past by storing a key and a value vector for every token it has seen. Double the context and you double those bytes: the cache is linear in tokens and unbounded until you cap the window. Sub-quadratic layers fold the past into a state of fixed size instead. Conifer’s engine recognizes two kinds, and both add zero bytes of KV per context position:

short convolution (LFM2 LIV): A gated short convolution mixes each token with a handful of recent ones through fixed filter taps. Decode is an O(1) step that updates a tiny rolling buffer, so the cost per token holds constant as context grows.
linear attention / SSM (DeltaNet): A recurrent state-space layer summarizes everything seen so far in one fixed-width state and updates it token by token. This is the line behind the newest large open models, where it replaces most of the quadratic attention outright.

A hybrid keeps a few true attention layers in the stack, so it still pays a real KV cache, just a much smaller one. Its memory curve stays nearly flat as the window fills. A dense model of the same size climbs steadily until it spills.

Memory is the limit on your own machine

At long context the cache, not the weights, overruns the budget first. A 24B dense model can spend more on K/V than on its own parameters once the window is large, and on a unified-memory Mac that overrun drives decode toward a crawl. The engine’s answer is to size the window to free RAM. The model’s answer is to make the cache small to begin with, which is what a hybrid does.

The savings buy length: more chat history, a long file, or a stack of retrieved documents, none of it forcing the slowdown a dense model would hit at the same depth. Short, one-shot prompts gain nothing here. The flat curve only pays off once the context is genuinely long.

Hybrid models in the catalog

Conifer’s hybrid line is LFM2, the Liquid Foundation Models: a short-convolution-and-attention design built for on-device use. They decode unusually fast for their size and hold that speed as context grows. A 2026 wave broadened the line with Mamba-2 hybrids from across the field — IBM’s Granite 4.0, TII’s Falcon-H1, AI21’s Jamba — most still waiting on an engine kernel for their architecture.

Model	Architecture	Profile
LFM2 / LFM2.5	short conv + attention	near-zero KV cache, long context stays fast
LFM2 MoE	hybrid + sparse experts	the flat-cache win with MoE breadth
Granite 4.0 H Tiny / H Small	Mamba-2 + attention + sparse experts	hybrid and MoE at once; tiny KV at 128K · load-gated
Falcon-H1 7B	parallel attention + Mamba-2	18 languages, 128K context at low KV · load-gated
Jamba Reasoning 3B	SSM-Transformer (Mamba + attention)	256K context on an edge 3B · load-gated

“Load-gated” means listed but not yet runnable, pending an engine kernel for the architecture. The Granite and LFM2 MoE rows layer expert routing on top of the hybrid layout; for what sparse routing buys and costs, see the MoE page.

LFM2-MoE puts a router over the same hybrid layout, so it carries both ideas at once. For what expert routing buys, and the memory bill it adds, read Sparse & MoE models. The newer state-space line (DeltaNet in the largest open releases) uses the recurrent SSM path rather than short convolutions; the engine handles both as zero-KV layers.

When to choose hybrid

Hybrid wins when context length is the constraint: long chats, document-grounded work, or any job where a same-size dense model would bog down as the window fills. For a quick prompt, dense is simpler and just as fast, so the choice only matters once the window is long. The model ledger lists size and decode speed per model; the hybrid advantage is the part a single-prompt number cannot show.

terminal

conifer serve --model lfm2.5-1.2b

What the cache costs#

Memory is the limit on your own machine#

Hybrid models in the catalog#

When to choose hybrid#

What the cache costs

Memory is the limit on your own machine

Hybrid models in the catalog

When to choose hybrid