The cache we tried not to build
We tried to train an LLM to be reliable enough that we wouldn’t need a cache. It got better and better and never got good enough — because the thing we needed wasn’t high reliability, it was determinism, and you can’t reach determinism by adding nines.
TL;DR
- We built an LLM assistant that edits docs and opens a GitHub PR. It wouldn’t converge: re-run it on its own output and it kept making small edits.
- Our first fix was more training data — a deliberate bet that a stable-enough model would let us skip building a cache (and the Redis tier behind it). Reliability as a substitute for infrastructure.
- The bet failed for a non-obvious reason: the output unit is a PR diff a human reviews, so a single spurious one-line change taints the whole PR. The failure unit is the review, not the line — and no realistic number of nines survives that.
- The fix was the cache we tried to avoid. But it did more than save tokens: caching each rewritten line back as “already correct” makes the system converge by construction — the deterministic layer enforces the property we couldn’t train in.
- The tell: whenever one bad output poisons a whole artifact (a PR, a transaction, a release), “high reliability” is not a weaker version of the guarantee you need. It’s a different thing. Build the boring stateful layer.
The setup
We were building an LLM assistant that edits documentation — a recent base LLM we fine-tuned and run ourselves, not a frontier model behind an API. You point it at a file, it rewrites the lines that need work, and it opens a GitHub pull request with the changes. The output isn’t a chat reply — it’s a diff a human reviews and merges.
Early on we hit a stability problem. Run the assistant on a file, then run it again on its own output, and it would keep suggesting changes. Small ones — a word here, a comma there — but it never cleanly stopped. For a tool that’s supposed to fix a file and then leave it alone, that’s a real defect: every re-run looks like unfinished work.
The fix we reached for first wasn’t code. It was more training data. And the reason we reached for it is the part worth writing about, because it was a deliberate bet about infrastructure, not about model quality.
The bet: make the model reliable enough to delete the cache
The reasoning went like this. If the model were stable enough — if it reliably left an already-correct line alone — we wouldn’t need to remember anything between runs. No cache of past edits. No store keyed by content. No stateful layer at all. The model’s own reliability would stand in for an entire piece of infrastructure we’d otherwise have to build, deploy, secure, and operate (in our case, a Redis tier, with all the provisioning and firewalling that implies).
That’s an appealing trade. Stateful infrastructure is a tax you pay forever; if you can make the model good enough, you get to skip it. So we grew the dataset, added “this line is already correct, leave it” examples, and retrained — chasing a stability rate high enough that we could just not build the cache.
It got better. Going from roughly 2k to 18k cleaned examples bought a real, visible jump in stability. But the returns were already the shape every scaling curve has — each increment buys less than the last — and the next increment was expensive: another order of magnitude of data meant a major data-cleaning and prep effort. We didn’t take that swing, and the reason we didn’t is the actual lesson. Even in the best case, scaling-law math says 10× more data buys a fraction of the previous gain, asymptoting toward a stability ceiling below 100% — and the determinism we needed sat on the other side of it. We’d have been paying more, for less, to chase a target the curve may never reach at any affordable amount of data.
So “it did not get good enough” wasn’t simply that we fell short of a target. It was that the target was the wrong shape for the tool we were using.
Why high reliability still loses
Two things defeated the bet, and they compound.
The model can’t promise determinism. We never got the per-line stability as high as we wanted. A probabilistic model can’t guarantee a deterministic property; you can push the probability of “no change on a clean line” toward one, but you can’t reach it.
The failure unit is the PR, not the line. This is the part that actually killed it. The output isn’t consumed one line at a time — it’s a pull request covering every changed line in a file. Say each line is independently 99.99% stable. A few-hundred-line file re-run a couple of times is several hundred independent chances for the model to twitch — you’re rolling those dice hundreds of times per PR, and across every PR the tool ever opens the misses accumulate into a near-certainty. And one nonsense one-line change is all it takes: the reviewer sees it, frowns, and concludes the tool can’t be trusted.
So the cost of a single miss is borne by the entire artifact, not amortized across all the lines that were fine. “Almost always idempotent” rounds down to “unreliable” the moment a human judges the aggregate.
You cannot approximate your way into a property that has to be absolute. Determinism isn’t a high probability — it’s a different kind of thing, and no amount of training buys it.
The fix: make the cache enforce what the model couldn’t
So we built the cache after all. But once we did, it did something better than save tokens. It made the system converge by construction, taking ownership of the property we’d failed to train into the model.
Convergence as a cache entry. When the assistant rewrites a line, we store two entries, not one. The obvious entry maps the original line to its rewrite. The second — the one that matters — takes the rewritten line and records it as “already correct, no change needed.”
That second entry is the whole game. The model was unreliable at recognizing “this line is already fixed, leave it” — so we stopped asking it to. The first time a line is rewritten, the cache records that its output is a fixpoint. On any later run, a line matching that output is served straight from cache as already-correct, and the model never sees it. Convergence stops being a behavior we hope the model exhibits and becomes a fact the cache enforces. Feed the assistant’s own output back in and it’s stable — not because the model finally learned restraint, but because the deterministic layer never gives it the chance to fiddle.
Invalidation as a cache key. The same layer cleans up a staleness problem the model also can’t reason about. A suggestion cached under one model and prompt is wrong the instant either changes, but nothing tells the model that. The general rule is simple: every input that can change the output belongs in the key. In practice that means the model version and the prompt version. Fold them into the key, and changing either shifts every key; the old entries don’t get deleted, they just become unreachable. Invalidation becomes a pure function of inputs the code already knows — no purge job, no model judgment involved.
And we still verify convergence — but not by trusting the model to self-report it. Convergence-by-construction only covers lines that hit the cache; a near-miss (reformatted whitespace, a re-wrapped line) still reaches the model and can drift, and a model or prompt bump cold-starts every key. So a separate offline harness re-runs the assistant against its own output and measures the real fixpoint rate, misses included. Deterministic checking, not a model asked to grade itself.
The lesson, generalized
The shape of this mistake is more common than the specific bug. We had a property that needed to be absolute — convergence, in the eyes of a human reviewing a diff — and we tried to satisfy it with a component that only deals in probabilities. Training made the probability higher. It could never make it certain, and certainty was the actual requirement.
The reframe that fixed it: stop trying to make the probabilistic component reliable enough to stand in for determinism, and put the deterministic property where it belongs — in deterministic code. The model proposes; the infrastructure around it decides what’s stale, what’s already done, and what counts as finished. We thought the cache was a cost to avoid. It turned out to be the only place the guarantee could actually live.
The tell, in hindsight, was the unit of failure. Whenever one bad output poisons a whole artifact — a PR, a transaction, a release — “high reliability” isn’t a weaker version of the guarantee you need; it’s a different thing wearing the same clothes. That’s the signal to build the boring stateful layer you were hoping to skip, because the nines were never going to add up to the one you actually needed.