skip to content
Optimizing around the tool call

Inside the engine

Optimizing around the tool call

When a model calls a tool, the output has to parse. The engine makes that true by construction, one decode step at a time.


A tool call is the tightest loop the engine runs. The model emits a small piece of structured text, the runtime parses it, a tool runs, its result comes back, and the model decides what to do next. The whole loop rests on one fragile assumption: the structured text actually parses. One stray brace or an invented field name breaks the call, and a local model slips more often than a frontier one. So the engine removes the slip rather than betting the model avoids it.

Valid structure by construction

Let the model write whatever it wants, parse the result, and retry on failure: that burns tokens and still fails sometimes. Constrained decoding inverts the order. A tool’s arguments are described by a JSON schema, and the engine compiles that schema into a check it runs at every decode step: given the bytes emitted so far, which token ids are a legal next piece of a schema-valid document?

Decoding produces a logit per token. Before sampling, the engine masks every token that would break the schema down to negative infinity, so the sampler can only choose among legal continuations. The model still picks the content: which city, which file path, which number. What it cannot pick is a token that makes the JSON unparseable. The only output the loop can produce is output that obeys the schema.

The check is byte-level and cached, so running it on every token is cheap. The compiled schema drives a small pushdown automaton over the JSON byte stream; a token is allowed when feeding all of its bytes keeps the automaton alive. The vocabulary is bucketed once by first byte, so each step tests only the tokens whose first byte the automaton currently permits, and identical automaton states (common inside a long string or number) reuse a mask the engine already computed. Per-step overhead stays small enough to run on every token without slowing decode.

The agent loop

Tool calls arrive in a known wire shape. The chat template tells the model to wrap each call in a delimited block carrying a function name and an arguments object. The runtime scans the completion for those blocks, extracts each into a name and a parsed arguments value, and keeps any surrounding prose as the turn’s reply. A truncated or malformed block is never dropped silently: its raw text stays in the content, so a half-finished call surfaces instead of vanishing.

The tool_choice field decides whether a call is optional, forbidden, or mandatory, and that decision sets whether the constraint runs at all.

How tool_choice shapes the step
tool_choiceWhat the engine does
noneTools are advertised but emission is off. The model answers in prose.
autoThe model decides. The runtime parses any call blocks it emits; the default.
requiredSome call must happen. The step is primed with the call marker; one advertised tool pins the schema.
{ name }Exactly this function. Arguments are constrained to that tool's schema, valid by construction.

When a single tool is forced, the engine wraps its parameter schema in an outer object that pins the function name as a constant, so the masked decode can only emit that one call.

The loop runs until the model stops calling tools and returns a final answer. Each tool result feeds back as context for the next step, and that step is decoded under the same rules. Conifer runs the loop on your own hardware: the model, the tool, the result, and the decision about what to do with it stay on the machine. For what a tool is allowed to touch in the first place, see the grant model.

Tool output is untrusted

A tool reads a web page, a calendar entry, or a file you did not write. Whatever comes back is data, not instruction, and a model that cannot tell the difference sits one crafted sentence away from following orders hidden in a document. Every tool result is untrusted input from the moment it re-enters the context.

The concrete risk is a forged role boundary. Chat templates use control markers to open and close turns, and a tool result that contains one of those markers verbatim could fake a second system turn or a prior assistant agreement. The runtime defuses that at the chokepoint where untrusted text enters a rendered prompt: it breaks any control marker so the tokenizer reads it as ordinary text instead of a role token. A human still sees the same characters; the model can no longer be tricked into reopening a privileged turn.

The full treatment of why output is wrapped this way, and what it does and does not contain, lives in untrusted output. The guarantee that none of this leaves your hardware lives in the local-first guarantee.

Where this sits

The tool call is the last of the four surfaces and the shortest-lived. A kernel is chosen once per architecture, a memory fit once per load, a prefill once per prompt. The grammar constraint applies at every decode step inside a single call, then resets for the next one. Read the other three from how the engine works, and start the agent side at agents.