skip to content
conifer serve

CLI & local API

conifer serve

Run an OpenAI-compatible endpoint over your local models, then point any client you already have at it.


Most tools that talk to a language model speak one dialect: the OpenAI HTTP API. conifer serve answers in that dialect from your own machine. It opens an HTTP listener over the models in your local registry, so a request bound for a hosted API lands on weights in your RAM instead. The bytes never leave the box, and the client reads the response as if nothing changed.

Start the server

Name a model and the process loads it at startup. Leave the flag off and the first request decides what to warm.

terminal
conifer serve --model qwen3-8b

The server binds 127.0.0.1:8080 and holds the foreground until you stop it with Ctrl-C. A model name is whatever the registry records; conifer list prints what is available to serve. One registered model is enough to go live.

The base URL

Every client exposes one knob for this: a base URL. Point it at the address below and leave the rest of the config as you would for a hosted provider.

SettingValue
Base URLhttp://localhost:8080/v1
API keyany non-empty string (ignored)
Modela registered name, e.g. qwen3-8b

The server runs no auth. The key field that some clients insist on filling is read and discarded; the gate is the loopback bind, not a token.

The routes are the ones you expect. Chat goes to /v1/chat/completions with streaming and non-streaming replies, /v1/models lists what is loadable, and /v1/completions covers the older text-completion shape. Request and response parameters live on Chat completions. For schema-constrained replies, see Structured output.

Serving more than one model

The catalog the server exposes is your whole registry, not just the model you booted with. Name a different one in the model field of a request and the runtime loads it on demand, unloading the resident model to do it: the engine holds one model at a time, so this is a swap through a single slot, not a warm pool. --keep-alive sets how long an idle model holds its weights before they are freed. For the mechanics, and how a swap drains in-flight work first, see JIT multi-model loading.

Common flags

--model
A registry name or a path to a .gguf to load at startup. Optional: leave it off and the server starts empty, loading on the first request.
--port
The TCP port to listen on. Defaults to 8080.
--host
The interface to bind. Defaults to 127.0.0.1. Read the note below before changing it.
--keep-alive
How long an idle model stays resident: seconds, a duration like 5m, -1 to pin it forever, or 0 to unload after every request. Defaults to 5m.

Wiring up a client

Anything built for the OpenAI API works against this endpoint once you swap the base URL. That covers the coding agents people already keep open: Continue, Aider, Cline, Zed, and any editor extension that takes a custom base URL. The walkthrough for Claude Code and Cursor is on Use with Claude Code & Cursor.

To confirm the server is answering, hit it with curl:

terminal
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-8b","messages":[{"role":"user","content":"ping"}]}'