JIT multi-model loading

Name any model in the request and the server loads it on demand, evicting the one it was holding to stay inside the machine’s memory budget.

A hosted API has a fleet behind it, so you can name a different model on every request and something, somewhere, already has it warm. conifer serve runs on one machine with a fixed amount of unified memory, and a set of weights large enough to be worth running rarely fits twice. JIT loading keeps the OpenAI contract working under that constraint. The model field selects from your local catalog, and the runtime loads whatever you name at the moment you name it, not at boot.

The catalog, not the filesystem

At startup the server reads a fixed name-to-path catalog from your local registry: the same list conifer list prints and the CLI manages. A request’s model field resolves to a catalog entry or nothing. It is never read as a filesystem path, so no request can talk the server into loading an arbitrary .gguf off disk. What you registered is what loads, and /v1/models reports that set without warming a single byte.

Name a model that isn’t in the catalog and the request still succeeds. Stock OpenAI clients hard-code ids like gpt-4o or local; rejecting those would break every one of them. So an unknown name is advisory. Whatever model is already resident answers, and the response echoes the real id that did the work. The id you read back is the truth, never the string you sent.

Load, swap, free

The engine holds one model at a time. That is a choice, not a gap. Two sets of weights and their KV caches fighting for the same unified memory is how a machine starts paging to disk and drops to a token a second. “Multi-model” here means an orderly rotation through one resident slot, not a warm pool.

When a request names a catalog model that isn’t resident, the server unloads the current one and loads the named one in its place. The swap waits for in-flight generations to drain first, so it never yanks a model out from under a turn that is still streaming. A swap costs the cold load: reading the weights, rebuilding the cache. Ask for the resident model again and you pay none of that.

keep_alive: how long a model lingers

A loaded model holds its memory until something tells it to let go. keep_alive is that instruction, the same field Ollama exposes, so clients that already set it work unchanged. Send it per request in the JSON body. A background reaper checks the resident model on an interval and unloads it once it has sat idle past its window, handing that unified memory back to the rest of the machine. The next request that names the model loads it again.

keep_alive values
Value	Meaning
`5m`	A duration string. Unload after this much idle time. Also 45s, 1h30m, and so on.
`300`	A bare number is seconds. The idle window, same as 5m.
`0`	Unload the moment the turn finishes. Memory back immediately, a cold load next time.
`-1`	Pin the model in memory. The reaper never touches it.

The idle clock starts when a turn ends, not when it begins, so a long generation never counts against the window. A keep_alive the server can't parse is a 400, never a silent default, so a typo can't quietly pin a model you wanted freed.

The last request wins, the way Ollama behaves. A request carrying its own keep_alive replaces the policy for the resident model. Send no per-request value and the server-wide default applies, set by the --keep-alive flag on conifer serve, which defaults to 5m.

Pinning versus rotating

Two patterns cover most setups. To run one model and keep every request instant, pin it with -1 and let it own its memory for the life of the server. To switch between models without stranding RAM on whichever one you touched last, leave a finite window so an idle model frees itself instead of waiting for the next swap to evict it. Picking the model is a separate question, covered in How to choose and sized against your hardware in By your hardware.

Seeing what is resident

Two listing surfaces answer different questions, neither one loading anything. /v1/models reports the whole servable catalog: everything you could load by naming it. The Ollama-style status surface reports the resident model and how much of its idle window is left, so you can tell at a glance whether the next request runs hot or pays for a swap. Wire shapes for both live on Chat completions.

terminal

# load on first use, then let it free itself after 10 idle minutes
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b",
    "messages": [{"role": "user", "content": "ping"}],
    "keep_alive": "10m"
  }'

All of this runs on the loopback bind. No weights and no prompts cross the machine boundary at any point in the rotation, which is the property the local-first guarantee spells out in full.

The catalog, not the filesystem#

Load, swap, free#

keep_alive: how long a model lingers#

Pinning versus rotating#

Seeing what is resident#

The catalog, not the filesystem

Load, swap, free

keep_alive: how long a model lingers

Pinning versus rotating

Seeing what is resident