CLI & local API
conifer serve
Run an OpenAI-compatible endpoint over your local models, then point any client you already have at it.
Most tools that talk to a language model speak one dialect: the OpenAI HTTP API. conifer serve answers in that dialect from your own machine. It opens an HTTP listener over the models in your local registry, so a request bound for a hosted API lands on weights in your RAM instead. The bytes never leave the box, and the client reads the response as if nothing changed.
Start the server
Name a model and the process loads it at startup. Leave the flag off and the first request decides what to warm.
conifer serve --model qwen3-8bThe server binds 127.0.0.1:8080 and holds the foreground until you stop it with Ctrl-C. A model name is whatever the registry records; conifer list prints what is available to serve. One registered model is enough to go live.
The base URL
Every client exposes one knob for this: a base URL. Point it at the address below and leave the rest of the config as you would for a hosted provider.
| Setting | Value |
|---|---|
| Base URL | http://localhost:8080/v1 |
| API key | any non-empty string (ignored) |
| Model | a registered name, e.g. qwen3-8b |
The server runs no auth. The key field that some clients insist on filling is read and discarded; the gate is the loopback bind, not a token.
The routes are the ones you expect. Chat goes to /v1/chat/completions with streaming and non-streaming replies, /v1/models lists what is loadable, and /v1/completions covers the older text-completion shape. Request and response parameters live on Chat completions. For schema-constrained replies, see Structured output.
Serving more than one model
The catalog the server exposes is your whole registry, not just the model you booted with. Name a different one in the model field of a request and the runtime loads it on demand, unloading the resident model to do it: the engine holds one model at a time, so this is a swap through a single slot, not a warm pool. --keep-alive sets how long an idle model holds its weights before they are freed. For the mechanics, and how a swap drains in-flight work first, see JIT multi-model loading.
Common flags
--model- A registry name or a path to a
.ggufto load at startup. Optional: leave it off and the server starts empty, loading on the first request. --port- The TCP port to listen on. Defaults to 8080.
--host- The interface to bind. Defaults to
127.0.0.1. Read the note below before changing it. --keep-alive- How long an idle model stays resident: seconds, a duration like
5m,-1to pin it forever, or0to unload after every request. Defaults to5m.
Wiring up a client
Anything built for the OpenAI API works against this endpoint once you swap the base URL. That covers the coding agents people already keep open: Continue, Aider, Cline, Zed, and any editor extension that takes a custom base URL. The walkthrough for Claude Code and Cursor is on Use with Claude Code & Cursor.
To confirm the server is answering, hit it with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-8b","messages":[{"role":"user","content":"ping"}]}'