§ 02 / The discipline
Five rules. No half-measures.
Rule 01
Declare the cache contract
Provider defaults will not cache for you. Mark the prefix every time — Anthropic cache_control, OpenRouter name:, the equivalent on whatever you call. The marker is the contract; without it you are paying full price for a prompt the provider would happily cache.
Rule 02
The prefix is sacred
A typo fix, an injected request id, a date stamp, a system-message edit — any of them changes the prefix hash and kills the cache for every subsequent pass. Lock the prefix. Mutate everything else. If your pipeline depends on cache hits, diff the prefix in CI.
Rule 03
Your SDK is lying to you
SDKs normalise messages, reorder content blocks, merge system instructions into one. They do it silently and they can do it differently across versions. Test against prompts you know should cache; verify hits in the response metadata; pin SDK versions if your pipeline depends on it. The cache layer is the place to be paranoid, not trusting.
Rule 04
Long, stable prefix. Small, varied suffix.
The shared part — transcript, brief, document, system harness — never moves. All pass-specific work goes after the cache, in the smallest message you can write. The shape of every call: long, stable, identical-across-passes prefix; tiny, varied, per-pass suffix. If you cannot describe what changes between two calls in one sentence, the cache will not hit.
Rule 05
Cache-hit ratio is the one metric
Total-token dashboards hide cache regressions. Track cached_tokens as a fraction of prompt_tokens per call. Alert when the ratio drops on a pipeline that should be hitting. A regression here is invisible until the bill arrives, which is the worst place to find it.