skip to content
Text, vision & audio

How models work

Text, vision & audio

Conifer runs the text path. A few catalog models can see images in their upstream form, but that input isn’t wired up yet, and audio isn’t exposed at all. This page tells you exactly what works now so you can choose for it.


A model’s modality is the kind of data it reads and writes. The catalog tracks it as a capability per model, and the one that’s fully live is text. That’s what nearly all local work asks for: chat, code, summaries, drafting, and tool calls.

Text

Every catalog model runs text in, text out. It’s the path the runtime, the studio, and the local API are built around, and it works on every architecture Conifer supports, from a 1B dense model to a 30B-A3B mixture-of-experts. The speed and intelligence numbers on the ledger measure this path and nothing else.

Vision, coming

A handful of catalog models ship with sight in their upstream release. Gemma 3 is the clearest case: the original weights take an image alongside the prompt. Conifer loads the GGUF text path of those models. Image input rides a separate piece, a vision projector that turns pixels into tokens the language model can read, and that projector isn’t wired into the runtime yet.

A vision-capable model runs as a text model in Conifer right now, and the catalog says so on each one. Gemma 3 4B loads as the GGUF text path with its vision capability unexposed; the larger Gemma 3 weights need a separate projector file. Until the projector lands, read every catalog model as text-in.

how a vision model behaves in Conifer today
ModelUpstreamIn Conifer today
Gemma 3text + visiontext only (GGUF text path)
Dense / MoE text modelstexttext, fully supported

The projector is the missing piece for image input, not the weights. The text half of a vision model already runs at full speed; the image half waits on the projector.

Audio, not yet

There’s no speech-in or speech-out surface today. You can’t hand Conifer a clip to transcribe, and the chat path won’t read a reply aloud. A local voice runtime is taking shape inside the engine, but nothing in the app exposes it. Treat audio as absent, not partial.

Need transcription or speech right now? Do it outside Conifer and pass the text in. Whisper or any local transcriber writes the words, and Conifer reads them like any other prompt.

Modalities at the API

The same rule holds at the local server. Chat completions take text and return text, the endpoint nearly every client already speaks. A request that attaches an image runs as if the image weren’t there, because the text path is the only path the loaded weights expose.

terminal
# text in, text out: the supported path
curl http://localhost:11434/v1/chat/completions \
  -d '{"model":"qwen3-8b","messages":[{"role":"user","content":"hello"}]}'