Never Trust the Model

Photo & Feedback: Vincent Tjeng

Case 1: The SIGSEGV That Had No Pattern

We had a server crash caused by segfault. No obvious pattern, binary rollback didn't help, core dumps didn't provide useful information, and experiments didn't surface anything unusual. The crashes only happened occasionally but frequently enough to make me lose sleep at night.

After extensive log analysis, we traced it to a vector out-of-bounds access. The root cause: an engineer was processing the output of a language model that was supposed to return three categories of information, each on a separate line. The code parsed the response by splitting on newlines, assumed there would always be more than three elements, and immediately erased the first three entries from the vector. No bounds check. No validation that the model actually returned what was expected.

For months, this worked fine — the model reliably followed the instruction format. Then, access to the newer preview model we were using expired, forcing us to fall back to an older, stable version. The older model had weaker instruction-following capabilities. It would occasionally return two lines instead of three, or merge categories together. The code didn't check, and the vector erase went out of bounds. SIGSEGV.

Why This Was Hard to Catch

The engineer had done model performance comparison — but only end-to-end metrics. The overall quality numbers looked fine. The edge case where the model deviated from the expected output format was rare enough that it didn't show up in aggregate benchmarks. And with language models, edge cases are effectively infinite — you can't test every possible output format variation.

The Real Fix

This isn't about one engineer making a mistake. When you build infrastructure that consumes model output, you're building on top of a non-deterministic foundation. Any code that treats model output as structurally guaranteed is carrying a latent bug — and it will surface exactly when you least expect it: during a model migration, a config change, or a subtle shift in input distribution.

"Treat LLM output like untrusted user input" is the slogan, but the engineering response has to be more concrete than that. A few specific things that, in retrospect, would have caught this:

Constrained decoding / structured output. Modern serving frameworks (OpenAI's JSON mode, Gemini API's structured output, Outlines, xgrammar) can constrain decoding to a grammar at the token level. Done right, this turns format adherence from a probabilistic property into a hard guarantee. But it isn't free — constrained decoding can hurt throughput on complex grammars, and on models with weaker instruction-following it can produce nonsense content that fits the schema but says nothing useful. So it's a tool, not a fix-all. The right pattern is: use constrained decoding to enforce shape, then still validate semantics.

Schema validation in the consumer. Even with structured output, the consumer code should parse against an explicit schema (Pydantic, proto, JSON schema) and route validation failures to a designed error path. The bug in our case was implicit: an erase(0, 3) on a vector with no length check. With a schema, the same code becomes parsed = Schema.validate(model_output); use(parsed.field_a, parsed.field_b, parsed.field_c) — and the failure becomes a typed exception instead of a SIGSEGV.

Monitoring on output distribution, not just end-to-end metrics. This was the actual gap in our case. The team had benchmarked the model swap on end-to-end quality and decided it was fine. Nobody monitored the shape of the output — how often it returned exactly three lines, how often the first line matched the expected category vocabulary. Distribution monitoring on producer output catches drift that aggregate quality metrics smooth over.

Canary the model migration. A model swap should look more like a binary rollout than a config flip. Send a small fraction of traffic to the new model, compare both quality and output distribution against the old model, and only ramp up if both look healthy. The team had infrastructure for this for code rollouts but didn't apply it to the model swap.

Stepping back: the bug isn't really about LLMs being non-deterministic. It's about an implicit contract between a producer and a consumer breaking when the producer changed. The same class of bug shows up every time an upstream service revises a response schema, or a data pipeline silently shifts a column type. What makes LLM output more dangerous is that the contract was never explicit in the first place — there was no schema, no version, no validation hook. The fix isn't "trust LLMs less" — it's "make the contract explicit, and then enforce it."

Case 2: When My AI Tutor Fabricated an Architecture Argument

While studying Google Vizier's distributed architecture, I had a long conversation with Gemini about how Vizier handles concurrent workers requesting hyperparameter suggestions simultaneously. For context, Vizier is Google's internal black-box optimization service — workers request parameter suggestions via RPC, run training, and report results back. The paper describes its storage layer only as a "Persistent Database."

The Fabrication

Gemini claimed that Vizier relies on Spanner's strong consistency to prevent concurrent workers from receiving identical parameter suggestions — arguing that without it, thousands of workers could read the same stale snapshot and get duplicate configurations, causing "catastrophic" compute waste. It even produced a direct quote from the paper: "All state is stored in a distributed database (Google's Spanner [8]), which provides high availability and consistency."

This sounded convincing. Strong consistency preventing duplicate work in a distributed system — textbook argument. It had a citation with a specific reference number. I moved on, fully believing I'd seen this sentence in the paper.

That quote doesn't exist. The word "Spanner" appears nowhere in the Vizier paper. Not once. And reference [8]? It's Desautels et al.'s work on Gaussian Process Bandit Optimization — completely unrelated to Spanner.

Gemini fabricated the quote, the attribution, and the reference number. And the fabrication was convincing enough that even when I later reviewed the blog draft with Claude, I didn't question the Spanner reference — I genuinely believed I had read that sentence. It had been planted in my memory by a hallucination.

This is the most dangerous thing about LLM hallucinations: they can contaminate your own memory. Once you've "seen" a fabricated quote with a specific citation, your brain files it away as something you've verified. It took going back to the actual PDF and ctrl-F'ing "Spanner" to realize the entire foundation of the argument was invented.

Beyond the fabricated citation, Gemini also inflated the scenario. The paper mentions scalability to "thousands of parallel trial evaluations per study," but a typical Study has tens to hundreds of workers. And "catastrophic" compute waste from a handful of duplicate Bayesian optimization trials among hundreds? That's a mild inefficiency, not a disaster.

The Pushback

I asked Gemini for a specific reference. It pointed to sections of the paper, but the evidence wasn't there. I asked it to check the open-source Vizier repository. At that point, Gemini reversed its position — admitting that concurrency is actually handled at the algorithm layer (through "pending trials" / hallucinated responses in the Bayesian model), not at the database layer.

What's interesting about that real design — once you strip away the fabrication — is that the Vizier team chose an algorithmic solution over an infrastructural one. The naive distributed-systems instinct is "use strong consistency at the database to prevent two workers from getting the same configuration." That works, but it solves the wrong problem. The right problem is "ensure that subsequent suggestions account for what we've already handed out, even if those trials haven't finished yet."

The pending-trials approach handles this elegantly: when worker A is given configuration X, the GP adds (X, predicted_outcome) as a soft observation. The next worker's suggestion naturally biases away from X, because the GP "thinks" it has already explored that region. No locking, no serializable transactions, no global ordering — the deduplication emerges from the optimization itself, and the global information about what's been explored is preserved. Strong consistency at the database would have been a hammer; pending trials is a scalpel.

The thing that bothers me about the original fabrication, in retrospect, is that the design choice Vizier actually made is more interesting than the one Gemini invented. A more careful reader would have spotted "Spanner for trial deduplication" as a strange architectural decision — strong consistency is expensive, and there's a cleaner algorithmic option sitting right next to the problem. I didn't push on that intuition. I trusted the citation.

The Sycophancy Problem

Here's what bothered me most: Gemini didn't just correct the factual error. It abandoned its entire engineering argument — including the parts that were defensible.

The fabricated quote was a real problem. But the underlying observation that "concurrent workers without coordination would explore redundantly" was correct, and the question of how to coordinate them is a real and interesting design problem. Vizier's actual answer — algorithmic, not infrastructural — was a sharper version of the same conversation. When I pushed back on the citation, Gemini didn't say "you're right, the citation is wrong, but the underlying concurrency problem is real, and here's how Vizier actually solves it." It collapsed to "okay, it's only a slight inefficiency, never mind."

The model has no conviction. It doesn't distinguish between "I was wrong about the facts" and "I was wrong about the reasoning." It just capitulates. The result, ironically, was that I had to do extra work to recover the correct engineering argument from the wreckage of the fabricated one.

Both of these experiences taught me the same thing from different angles. In production, the crash happened because the code treated an LLM's output as having a guaranteed format. In learning, I almost published a blog post built on a fabricated citation because I treated an LLM's confident citation as a guarantee. The failure mode is the same: treating output from a probabilistic system as if it carried the kind of contract you'd expect from a typed interface.

The server crash cost us debugging hours. The hallucination almost cost me credibility. The fix in both cases is the same: make the contract explicit, and verify before building on top of it.

What's Next

This experience reinforced something I wrote about in my first blog: you can't build good infrastructure for a system you don't understand. Next up, I'm working through transformer internals with Karpathy's neural network series — same approach, infra perspective.

References:

Never Trust the Model

Case 1: The SIGSEGV That Had No Pattern

Why This Was Hard to Catch

The Real Fix

Case 2: When My AI Tutor Fabricated an Architecture Argument

The Fabrication

The Pushback

The Sycophancy Problem

What's Next

Comments

More from this blog

Understanding Pathways: How Google Scales to Thousands of TPUs

From Adam to Mixed Precision

Command Palette

Case 1: The SIGSEGV That Had No Pattern

Why This Was Hard to Catch

The Real Fix

Case 2: When My AI Tutor Fabricated an Architecture Argument

The Fabrication

The Pushback

The Sycophancy Problem

What's Next

Comments

More from this blog