Announcement
Sage: one app for local AI.
Sage is the Conifer desktop app — the engine, the tools, and the framework in a single download. Open it, choose a model, and start a conversation. Everything runs on the machine in front of you.
Running capable AI on your own hardware has meant assembling parts: a runtime, a model, the right quantization, extensions, and a separate tool for each workflow. Sage removes that assembly. It is a single application that brings the inference engine, the tools, and the framework integration together, with more than 30 open-weight models behind one picker.
The model layer stays transparent and stays out of the way. Sage handles download, storage, quantization, memory fit, and hardware-aware execution, so the work of getting a model to run well on your machine is already done. Open the app, pick a model, and start typing.
Why this matters
Local AI is no longer limited by model capability alone. Open-weight models are good enough for most everyday work. What remained was friction — accessibility, trust, and setup overhead. Sage is built to remove that friction, so a local model becomes a tool you reach for rather than a system you configure.
What’s included
Sage is one application, not a kit to assemble:
- The inference engine, with kernels tuned per architecture and per chip.
- The tools a model needs to be useful — files, search, and more — behind explicit, deny-by-default grants.
- Framework integration, so the app, the CLI, and a local OpenAI-compatible API share one runtime.
- A model picker spanning more than 30 open-weight models, from 0.5B to the largest open releases.
- A chat surface you can open and use right away.
Benchmarks
Simple to use does not mean unmeasured. The engine is benchmarked head-to-head against the field on the same hardware, with the same models and workloads, and nothing leaving the device.
On an Apple M3 Max with the Metal backend, decode runs close to the memory-bandwidth wall — the limit that ultimately caps local token generation — reaching up to about 89% of the chip’s theoretical bandwidth on the 7–8B models. Against llama.cpp on identical Q4_K_M weights, Conifer leads on decode across every model tested and sits at parity on prefill. MLX, Apple’s own framework, still leads on decode for several models; Conifer is ahead on prefill for many.
| Model | Conifer | llama.cpp |
|---|---|---|
| LFM2-350M | 764 | 506 |
| Llama-3.2-1B | 270 | 206 |
| Qwen2.5-7B | 57 | 53 |
| Llama-3.1-8B | 55 | 51 |
| Gemma-3-12B | 26 | 24 |
Apple M3 Max (36 GB), Metal backend, Q4_K_M weights, 512-token prompt and 128-token decode, best of three runs. Per-model figures live in the model ledger.
A few highlights from the same run:
- Fastest decode: 764 tok/s on LFM2-350M.
- Lowest energy: 0.06 joules per token on LFM2-350M.
- Best efficiency: about 16 tokens per second per watt on LFM2-350M.
- Local marginal cost: from $0.0031 per million tokens — electricity, not an API bill — with your data staying on-device.
On Windows with NVIDIA, the CUDA backend reaches parity with llama.cpp on decode; CUDA prefill and the Vulkan backend are still being optimized.
Why local
Cloud AI is increasingly priced and scaled for the frontier, which is more than most everyday work needs. Open-weight models now cover that work. You should not have to weigh cloud against local for every task. Sage defaults to on-device execution — fast, private, and predictable — and keeps your prompts, documents, and outputs on the machine in front of you.
What’s next
The engine keeps improving, and the app layer around it keeps growing.
Quick start
- Download Conifer for your platform.
- Open the app.
- Pick a model.
- Start chatting.