How the engine works

The engine is the runtime behind the free tier: it makes models fast on your own hardware, through four optimizations on four surfaces, each with a different lifetime.

In the gateway, your own machine is tier 0, the lane that costs nothing and takes the everyday majority of queries. The engine makes that lane fast, with Metal kernels measured faster than llama.cpp and at per-byte parity with Apple’s MLX.

The four surfaces

The architecture: What shape is this model? Settled ahead of time, per family: the Metal kernels are specialized to the layer pattern before any weights ever load.
The model: Which exact weights, on which machine? Settled at load time: the runtime fits the file to the memory actually present and decides which bytes move.
The prompt: What is in the context right now? Settled per request: context that is already computed is reused, not recomputed.
The tool call: What must the output obey? Settled per step in an agent loop: decoding is constrained so only tokens legal for the format can be emitted.

Each surface settles at a different time, so each is optimized once, when that work is cheapest.

How they compose on a single token

A single token in an agent loop crosses all four in order: the kernel decides how the weights stream, the model decides which bytes move, the prompt layer decides how much context is already computed, and the tool call decides which tokens are legal. None of it is configuration; the defaults run the whole chain.

The four surfaces#

How they compose on a single token#

The four surfaces

How they compose on a single token