skip to content
Structured output

CLI & local API

Structured output

Ask the server for JSON and get JSON, enforced one token at a time so the reply parses on the first try.


A model that almost always returns valid JSON is a model that sometimes returns broken JSON. The usual workaround is to parse the reply, catch the failure, and ask again. That costs a round trip and still leaves a tail of cases it never resolves. Conifer closes the gap differently. Add a response_format to a chat request and the server constrains decoding so the only output it can produce is valid JSON. Nothing to retry, because nothing invalid ever leaves the decoder.

Three modes of response_format

The field lives on the standard /v1/chat/completions request and follows the OpenAI shape, so a client that already sets it against a hosted provider works unchanged against conifer serve. It takes three values.

typeWhat decoding is constrained to
textNo constraint. A plain turn; the same as omitting the field.
json_objectAny single, well-formed JSON object. Shape is up to the model; syntax is guaranteed.
json_schemaOne document matching the schema you supply, down to keys, types, and which fields are required.

The schema case wraps the schema as the OpenAI API does, in a json_schema object with a name and the schema itself. A strict flag is accepted for wire compatibility and ignored: the decoder enforces the schema exactly, so a loose mode would have nothing to relax.

terminal
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Give me the capital and population of France."}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "country_fact",
        "schema": {
          "type": "object",
          "additionalProperties": false,
          "required": ["capital", "population"],
          "properties": {
            "capital": {"type": "string"},
            "population": {"type": "integer"}
          }
        }
      }
    }
  }'

The reply’s content is a JSON string that parses and matches the schema. The model still picks the values: which city, which number. What it cannot pick is a token that would make the document unparseable or off-schema.

Enforced, not asked for

Prompting a model to “reply in JSON” is a request it can ignore. This is not a request. At every decode step the engine knows which token ids keep the output a legal continuation of the schema, and it masks every other token out of the running before the sampler sees them. The constraint is the same SchemaConstraint the forced tool-call path uses, applied to one generation instead of a loop. The mechanics, the byte-level automaton, and why running it every token stays cheap are on optimizing around the tool call.

What the schema can express

The constraint compiler covers the constructs that tool-call and extraction schemas actually use. Anything past that, it refuses loudly rather than honoring a constraint it cannot enforce. A schema outside the subset returns a 400 at request time, before any weights load. The failure is loud and immediate, never silently unconstrained text.

Scalars
string, number, integer (no fractional part), boolean, and null.
Composites
object with a fixed, ordered property set and required keys, and array with a single item schema.
Choice and constants
enum and const for fixed values, anyOf and oneOf for a choice between shapes.
Open vs. closed objects
additionalProperties: false pins an object to exactly its declared keys; true lets the model add free-form ones. The empty schema {} and json_object both mean “any JSON object.”

Structured output and tool calls

Both ride the same constrained decoder, which is why you cannot use both in one turn. Sending response_format alongside tools returns a 400. A tool call is already one schema-constrained generation wrapped in an agent loop, and two grammars on a single decode leave it ambiguous which one owns the output. Pick the one that matches the job.

Reach forWhen you want
response_formatA typed answer back from one turn: an extraction, a classification, a parse.
a tool callThe model to choose an action and pass arguments, then act on the result and continue.

A forced single tool call is the closest the tool path gets to plain structured output: it pins one function and constrains its arguments to that tool's schema.

The agent side starts at agents, and the request and response parameters that surround response_format live on conifer serve. Like everything the server does, the schema, the prompt, and the JSON it produces stay on the machine. That boundary is the local-first guarantee.