School of QuestSchool of Quest
·Next cohort · by application·A program · online · worldwide
§ 11Stack · SoQ LLM

SoQ LLM. The cache contract your SDK is breaking.

Most teams running multi-pass generation pay for the same context five times. We don't. The reason is a small discipline that costs nothing to adopt and is impossible to half-do — five rules, written down, used every day against our own pipeline. Most LLM-cost tooling sells you something else.

§ 01 / Why this is the lever

Most multi-pass pipelines are paying 5x what they should.

Our synthesis pipeline runs five passes against each session transcript — summary, highlights, chapters, flags, tag suggestions. The transcript is long; it is the same long thing for all five passes. Without cache discipline, you pay for that prefix five times. With cache discipline, you pay for it once and four near-empty calls thereafter. On a provider that supports prompt caching with a published cache discount around 90%, the structural saving on the marginal cost of each multi-pass run is in that range.

Most teams running multi-pass generation are leaving most of this on the table. Three reasons we see again and again. They trust SDK defaults that quietly break cacheability. They do not realise the cache-hit metadata is sitting in the response object. Their middleware reorders messages between calls without telling them. The discipline below is what we use against our own synthesis pipeline.

§ 02 / The discipline

Five rules. No half-measures.

Rule 01

Declare the cache contract

Provider defaults will not cache for you. Mark the prefix every time — Anthropic cache_control, OpenRouter name:, the equivalent on whatever you call. The marker is the contract; without it you are paying full price for a prompt the provider would happily cache.

Rule 02

The prefix is sacred

A typo fix, an injected request id, a date stamp, a system-message edit — any of them changes the prefix hash and kills the cache for every subsequent pass. Lock the prefix. Mutate everything else. If your pipeline depends on cache hits, diff the prefix in CI.

Rule 03

Your SDK is lying to you

SDKs normalise messages, reorder content blocks, merge system instructions into one. They do it silently and they can do it differently across versions. Test against prompts you know should cache; verify hits in the response metadata; pin SDK versions if your pipeline depends on it. The cache layer is the place to be paranoid, not trusting.

Rule 04

Long, stable prefix. Small, varied suffix.

The shared part — transcript, brief, document, system harness — never moves. All pass-specific work goes after the cache, in the smallest message you can write. The shape of every call: long, stable, identical-across-passes prefix; tiny, varied, per-pass suffix. If you cannot describe what changes between two calls in one sentence, the cache will not hit.

Rule 05

Cache-hit ratio is the one metric

Total-token dashboards hide cache regressions. Track cached_tokens as a fraction of prompt_tokens per call. Alert when the ratio drops on a pipeline that should be hitting. A regression here is invisible until the bill arrives, which is the worst place to find it.

§ 03 / Where it runs

One pipeline today. The patterns generalise.

We run synthesis on every Zoom session our cohorts produce, through the harness above. Same five passes, same model, same outputs across runs — the structural delta from the cache discipline is the difference between a synthesis bill that is a line item and one that is a meeting.

The harness ships inside an Avira install as part of the synthesis path. The patterns above are public and apply to any multi-pass pipeline against a long shared context — RAG-with-rerank, multi-judge eval, multi-perspective synthesis, anything where the same prefix gets seen more than once. If you want the harness sooner than you can write it yourself, talk to us.

Stop paying for the same context five times.

If your synthesis bill is climbing and you cannot tell which SDK call is breaking the cache, we will walk your pipeline. The discipline above, applied to yours.