skip to content
Chat completions

CLI & local API

Chat completions

The /v1/chat/completions endpoint, its parameters, and the streaming and non-streaming shapes that come back.


This is the endpoint most clients hit. You send a list of messages and a few sampling knobs. The server renders the chat template, runs the model in your RAM, and returns an OpenAI chat.completion. The wire shape matches the hosted API field for field, so a client written against OpenAI works the moment its base URL points at the local server. conifer serve covers bringing the server up and setting that URL.

The request body

Two fields are required. model is a name from your registry; messages is the conversation so far. Each message is a { "role", "content" } pair, where role is system, user, or assistant. The template that turns those messages into tokens belongs to the model, so the same request reads differently to a Gemma than to a Qwen. Everything else is optional and falls back to a default.

Optional parameters and their defaults
ParameterDefaultMeaning
temperature0.0Sampling temperature. The default is greedy decoding: the highest-probability token every step.
top_p1.0Nucleus sampling cutoff. 1.0 keeps the full distribution; lower values trim the tail. Clamped to [0, 1].
max_tokens512Cap on tokens generated this turn. Hitting it ends the turn with finish_reason “length.”
stopnoneA string or array of strings. Generation halts at the first match and the matched text is removed from the reply.
seed0Fixes the sampler. The same seed and inputs reproduce the same output; a different seed diverges.
streamfalseWhen true, tokens arrive as server-sent events instead of one buffered reply.

Fields the server does not model are accepted and ignored, so an existing OpenAI request body needs no trimming.

Reasoning and structured output

Two parameters change what comes back rather than how it gets sampled. enable_thinking is off by default: any <think> span a reasoning model emits is stripped from content and discarded, leaving plain prose. Set it to true and that span comes back separately in a reasoning_content field, never mixed into the answer. To force the reply into a JSON object or a schema, use response_format: the engine constrains decoding at the token level, so the output is valid by construction. That path has its own page, Structured output.

A request can also carry keep_alive to set how long the model serving it stays resident after the turn, the same idle TTL the --keep-alive flag sets globally. Name a model that is not yet loaded and the runtime loads it on demand, evicting a cold one to stay in budget. That is JIT multi-model loading.

The response

A non-streaming call returns one JSON object with object: "chat.completion". The reply text sits at choices[0].message.content. Next to it, choices[0].finish_reason says why the turn ended, and a usage block accounts for prompt and completion tokens. Three finish reasons appear:

stop
The model produced its turn-end token, or a stop string matched. The ordinary case: a complete reply.
length
Generation hit max_tokens first. The reply is a truncated fragment; raise the cap to let it finish.
tool_calls
The model chose to call a tool you advertised in tools. The parsed calls are in message.tool_calls and content is usually empty.

Streaming

Set stream: true and the response becomes a text/event-stream of chat.completion.chunk events. The first chunk carries the role. Each later chunk carries a slice of the answer in choices[0].delta.content as that token decodes. A final content-less chunk reports the finish_reason, then the stream closes with the literal line data: [DONE]. Tool calls stream the same way, as delta.tool_calls fragments you concatenate by index.

A request you can run

With a server up, this sends a two-message conversation and prints the buffered reply. Add "stream": true to watch the same answer arrive token by token.

terminal
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b",
    "messages": [
      {"role": "system", "content": "You are terse."},
      {"role": "user", "content": "Name three sub-quadratic attention variants."}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

Those fields are the whole contract. To wire this into an editor agent instead of curl, see Use with Claude Code & Cursor. None of it touches the network past loopback, which is the point of the local-first guarantee.