Reference
Glossary
One line per term, with a link to the page that defines it in full. The line places the word; the link explains it.
Each term is defined once, somewhere deeper in the docs, and repeated nowhere. The line under each word is enough to recognize it in context. The link is where the full explanation lives, trade-offs and numbers attached.
Terms
- Agent
- A local model handed a set of tools and the authority to use them. It can read a folder or check a calendar instead of only answering in text.
- Base model
- Raw weights trained only to predict the next token, before any instruction tuning. It continues text rather than following a request.
- Bandwidth
- The rate at which weights stream from memory into the compute units. Because decode reads every active weight per token, bandwidth is the wall most models hit, not raw math throughput.
- Context window
- How many tokens a model can attend to at once. Everything inside it occupies the KV cache, so a larger window costs memory.
- Decode
- The second phase of generation, producing the reply one token at a time. Its speed in tokens per second is what you feel while a model answers.
- Dense model
- Reads every parameter on every token. Speed and memory stay easy to predict, and support is the most complete in the catalog.
- GGUF
- The single-file weights format Conifer loads. It holds the model and its quantization in one container the engine maps straight into memory.
- Grant
- A scoped, revocable permission bound to one agent. The kernel checks every tool call against the grants you set, and an agent reaches nothing you did not grant.
- Hybrid model
- An architecture that mixes attention with sub-quadratic layers to keep a small KV cache, so it stays fast as the context grows long.
- Instruct model
- A base model tuned to follow instructions and hold a conversation. The default tuning for chat and most tasks.
- JIT loading
- Name any model in a request and the runtime loads it on demand, evicting cold models to stay inside the memory budget.
- KV cache
- The stored keys and values for every token already in context, kept so the model need not re-read the prompt each step. It grows with the window and is the memory cost most people miss.
- Local-first
- The default that a model runs entirely on your hardware with no outbound call. There is no cloud to opt out of, and nothing leaves the machine on its own.
- Modality
- The kind of input a model accepts: text, vision, or audio. The GGUF text path Conifer runs today is the text modality.
- MoE
- Mixture-of-Experts, a sparse design where a router fires only a few experts per token. You get a large model’s knowledge at a small model’s decode speed, as long as every expert fits in memory.
- Parameter
- One learned number in a model’s weights. The count (7B, 30B) sets how much it knows, and for an MoE the catalog also lists the active parameters that move per token.
- Prefill
- The first phase of generation, reading the whole prompt in one pass to fill the KV cache before the first token. It sets how long you wait for a reply to begin.
- Quantization
- Storing weights at lower numeric precision to shrink the file and the bandwidth per token. A little quality traded for a much smaller memory footprint.
- Reasoning model
- A model tuned to think in explicit steps before answering. It is stronger on hard math and multi-step problems, and slower because it writes that work out.
- Serve
- Run the runtime as an OpenAI-compatible server on localhost, so any existing client points at your machine instead of a cloud endpoint.
- Structured output
- Constraining generation to a JSON schema or grammar at the token level, so the output parses by construction rather than by hope.
- Token
- The unit a model reads and writes, a word piece rather than a character. Speed is measured in tokens per second, and the context window is counted in tokens.
- Untrusted output
- Anything a tool returns (a file, a note, a web page) wrapped as external data before the model sees it, so instructions hidden inside cannot pose as commands from you.
- Weights
- The trained numbers that are the model. Loading a model means reading its weights into memory, and their size after quantization is the largest part of the memory budget.
See also
When a model will not load or a reply crawls, the symptoms and fixes are in Troubleshooting. To pull more tokens per second from a model that already runs, the levers worth touching are in Performance & tuning. Every binding in the studio is listed under Keyboard shortcuts, and what shipped and when is in the Changelog.