Getting started
Your first chat
From a cold app to a reply streaming across the screen, on hardware you already own.
A local runtime makes the most sense once you watch it answer. The path runs from the studio’s landing screen to a streaming reply, then names the three phases in between. This assumes the app is installed and a model is on disk. If neither is true yet, start there and come back.
Open the studio
Launch Conifer and a quiet screen meets you: a greeting, a model picker, and two doors marked chat and code. The picker only sets which weights you mean. It loads nothing. RAM stays untouched until you walk through a door, so the landing screen is instant no matter how large the model you selected.
Pick a small model for this first run. A 4B or 8B model answers in a breath on most laptops and leaves room to feel the speed; a 30B model works too, it just asks more of your memory. The models ledger lists every model with its size and measured decode speed, and By your hardware works backward from the RAM you have. Click chat.
Send the first message
Open chat on a freshly chosen model and the runtime reads the weights off disk into memory. That is the cold start, and it is the one wait you pay per model rather than per message. After it, the model is resident: it sits in RAM and answers the next prompt with no reload.
Type a question and send it. A cursor appears, holds for a moment, then text arrives a few words at a time. The pause before the first word is the model reading your prompt. The flow after it is the model writing the answer.
What just happened
Three phases turn a prompt into a streaming reply. They run in order, every turn, and their names account for most of what the studio’s timing feels like.
| Phase | What it does | How it feels |
|---|---|---|
| Load | Weights move from disk into memory, once per model. | A one-time wait, then gone. |
| Prefill | The model reads your whole prompt to produce the first token. | The pause before the first word. |
| Decode | The model writes the rest, one token at a time. | Text streaming in. |
Decode speed is the tok/s number on the models ledger. Prefill scales with how long your prompt is; decode scales with how long the answer is.
The streaming is decode happening live. Each token is produced, then handed to the screen, so the answer reads itself out instead of landing all at once. For why decode is the part worth optimizing and how the runtime keeps it fast, see how the engine works.
Nothing left the machine
No request went to a server. The prompt, the weights, the cache holding the conversation, and the reply all stayed in your machine’s memory. There is no account to sign into and no key to paste because there is no remote endpoint to authenticate against. That is the local-first guarantee: the default, not a setting you turned on.
Keep going
Send a follow-up. The model is already resident, so the load wait is gone and only prefill and decode remain. The conversation so far lives in a cache the runtime reuses, so later turns in a long chat stay responsive instead of re-reading everything from scratch.
- Switch models from the picker. The runtime unloads the resident weights and loads the new set, so expect one cold start on the swap.
- Try a different task. Coding, reasoning, and long-document work each favor different models; How to choose narrows the field in four questions.
- Give a model tools. Files, notes, and a calendar are one grant away inside the studio. Read Agents before you wire one up, and the grant model for why nothing reaches a folder you did not hand it.