Cunzhe's Note

Understanding Pathways: How Google Scales to Thousands of TPUs

Cunzhe — Mon, 13 Apr 2026 07:09:02 GMT

Photo: Vincent Tjeng

Google presented the Pathways vision back in 2021: train a single large model that can do millions of things. At the time, ChatGPT didn't exist yet, and this idea felt genuinely novel. Looking back from 2026, the vision has been executed remarkably well — Mixture of Experts, multi-modality, and the "generalist" foundation model have all become industry standards.

But the Pathways paper (Barham et al., MLSys 2022) is not about any of those things. It's about something more fundamental: how do you build a distributed system — what the paper calls "a new large scale orchestration layer for accelerators" — that can actually support the research needed to get there?

I spent some time reading this paper, and found it genuinely impressive. The design is creative — a lot of the solutions surprised me. At the same time, the paper kept reminding me of classic systems ideas — resource virtualization, async dispatch, gang scheduling, dataflow execution — repurposed and combined to solve the specific constraints of large-scale ML training. That combination of fresh thinking and deep roots in established systems design is what makes it interesting. Understanding the problem first, then the solution, and then why this solution, was how I tried to approach it.

Why This Is a Hard Problem

One thing I kept thinking about while reading is how different distributed training is from the distributed systems I'm more familiar with in online serving.

In a typical distributed backend — a recommendation serving system, for example — when you hit a scaling wall, the brute-force approach is just adding more machines. And yes, that sometimes works. But in practice there's usually a lot of infrastructure work involved: revamping the serving framework, optimizing the data pipeline, rethinking the caching strategy. The point is, there are many well-understood tools in the toolbox, and the latency of individual requests is mostly independent — if one replica is slow, it doesn't block the others.

Distributed training doesn't work this way. You have a single computation split across thousands of accelerators, and that computation has thousands of sequential steps. Every step requires all accelerators to synchronize — exchange gradients, agree on the updated weights, then proceed. A single slow node holds up the entire fleet. The latency isn't per-request — it's cumulative across every synchronization point.

This actually reminds me of large-scale data pipelines. When a single machine can't handle a heavy multi-step computation, you start parallelizing — first multi-threaded on one machine, then distributed across many, which is how frameworks like MapReduce and Spark came about. The progression from SPMD to what Pathways does feels like a similar evolution: the scale forces you to rethink the programming and coordination model.

Another important difference is fault tolerance. In online serving, you can often fall back to a default value or accept degraded performance when something fails. In training, you can't just skip a node's result and move on to the next step — the computation depends on every participant completing.

This is more HPC territory than cloud service design, and I think it's the key reason why training infrastructure has become its own discipline. Interestingly though, as we'll see, Pathways solves these HPC-scale problems by borrowing architectural concepts that are deeply rooted in serving systems — async dispatch, resource multiplexing, centralized scheduling.

SPMD and Its Limits

Most training systems at the time used SPMD (Single Program Multiple Data), inspired by MPI. Every accelerator runs the same program, processes different data, and communicates through collectives like AllReduce. In multi-controller SPMD setups (like JAX or PyTorch DDP), each host runs its own copy of the program and dispatches computations over fast PCIe links with minimal coordination overhead. Communication happens through dedicated interconnects like NVLink or TPU's ICI.

But the paper identifies three walls that were becoming increasingly problematic:

Pipelining. Very large language models can't fit on a single accelerator. A common solution is pipeline parallelism — splitting the model into stages across devices. I briefly touched on this in a previous post. But pipelining is inherently heterogeneous — different stages run different parts of the model. Researchers built workarounds (GPipe, PipeDream, Megatron), but they were essentially hacking MPMD behavior on top of an SPMD runtime.

Computational sparsity. Models like Mixture of Experts activate only a subset of parameters per input, requiring data-dependent routing — the kind of heterogeneous control flow SPMD wasn't designed for.

Resource acquisition. Getting a large, symmetric block of accelerators is expensive. It's much easier to get several smaller "islands." But SPMD assumes exclusive ownership of a single homogeneous pool, pushing toward MPMD setups.

The Core Tension: Single-Controller vs. Multi-Controller

The fundamental architectural choice comes down to who controls the accelerators:

Multi-controller systems (JAX, PyTorch DDP) run a copy of the user program on every host. Dispatch is fast — just a PCIe call. But coordination beyond standard collectives requires custom implementation, and there's no centralized view of the cluster for resource management or scheduling.

Single-controller systems (TensorFlow v1) offer a flexible programming model: a central client builds a computation graph and partitions it across workers. This gives you resource virtualization, centralized scheduling, and arbitrary computation patterns. But TF v1 ran into three real problems:

Dispatch latency. Every dispatch goes over DCN instead of local PCIe — an order of magnitude slower. For pipelined models with many cross-host transfers, these latencies accumulate.

Gang scheduling. When multiple programs share accelerators, communicating computations must be enqueued in consistent order. TPUs are single-threaded and run non-preemptible kernels — inconsistent ordering means deadlock. TF v1 could enforce ordering within one program but not across programs. Bodun Hu's post covers this deadlock issue well.

Graph explosion. A naive dataflow graph between an M-way sharded computation and an N-way sharded computation requires M + N nodes and M × N edges. At thousands of shards, this becomes unmanageable.

The Pathways thesis: you can have the programming flexibility of single-controller with the performance of multi-controller.

How Pathways Solves It

Compiled Functions and JAX Integration

Pathways can be integrated as a backend for JAX. Users wrap Python code with decorators to create "compiled functions" — XLA computations whose input/output shapes and resource requirements are known before execution. This is a useful property inherited from JAX: because resource requirements are known upfront, the system can plan ahead.

The paper also describes a program tracer (Section 3, Figure 2) that wraps a block of Python code calling many compiled functions and generates a single Pathways program — a dataflow graph where each compiled function becomes a computation node. This avoids the overhead of a separate Python call and RPC for each function, which matters when you're chaining many operations back to back.

Resource Virtualization

Pathways introduces virtual devices mapped to physical devices by a centralized resource manager — conceptually similar to virtual memory but for accelerator resources. The initial implementation is deliberately simple, but the abstraction enables things like transparent suspend/resume and migration without user cooperation. The Pathways on Cloud documentation shows that elastic training and transparent preemption handling have since been built on this.

Decoupling Control and Data Planes

Two of Pathways' design choices — the sharded dataflow via Plaque and the sharded buffer abstraction — are really two sides of the same coin: keeping the control plane lean by operating at a coarser granularity than the data plane.

Sharded dataflow via Plaque. A quick note on terminology: "dataflow" here refers to the computation model where operations execute when their input data arrives — not Google Cloud Dataflow (the Apache Beam-based data processing product). Same underlying concept, very different systems.

Pathways relies on Plaque — a closed-source production sharded dataflow system used at Google — for all cross-host coordination over DCN. In a naive implementation, chaining two computations A and B, each sharded across N devices, produces a graph with 2N nodes and potentially N² edges. In Plaque's representation, the same chain requires only 4 nodes (Arg → Compute(A) → Compute(B) → Result) regardless of N. The N data tuples flow between these logical nodes tagged with destination shards.

I had to think about this for a while, because the physical data transfers don't disappear. If the model requires an All-to-All exchange, the hardware still moves all that data. What changes is that the control plane doesn't manage those connections individually. In a discussion about this, Gemini framed it well: the central controller only issues lightweight high-level directives, while the thousands of local executors and network cards handle the actual data routing independently. The controller writes the checks; the distributed data plane cashes them. This inversion of responsibility is what makes single-controller viable at scale — the brain stays lean by delegating the heavy lifting to autonomous local agents.

The paper notes that this design could be re-implemented using Ray instead of Plaque, though additions like an HBM object store and GPU interconnect primitives would be needed. Siyuan Zhuang's Zhihu analysis does a thorough comparison of Pathways and Ray's designs — many of their architectural choices are strikingly similar.

Sharded buffer abstraction. The same principle applies to memory management. In older single-controller systems, the client becomes a bottleneck tracking thousands of individual shards and buffers. Pathways introduces a sharded buffer abstraction — a logical buffer distributed over multiple devices, with bookkeeping amortized at the logical level rather than per-shard.

Both of these are instances of a pattern that shows up everywhere in systems: adding an indirection layer and choosing the right granularity for management. Virtual memory does the same thing for physical RAM. PagedAttention in vLLM applies the same idea to KV cache in inference — the core concept of decoupling logical and physical layout through a mapping table feels like the same fundamental move applied to a different problem. Whether you're aggregating many small shards into one logical unit (Pathways) or slicing one large allocation into many manageable pages (PagedAttention), the underlying technique is the same: decouple the abstraction your logic sees from the physical reality underneath. As David Wheeler famously put it: "All problems in computer science can be solved by another level of indirection."

Gang Scheduling

Pathways includes a centralized scheduler per island that consistently orders all computations, currently using FIFO with the architecture supporting more sophisticated policies.

In the Gemini conversation about this paper, I compared Pathways to Borg — both are centralized systems that pool and virtualize shared hardware resources, allocating them across multiple concurrent workloads. The parallel feels right at the macro level: Pathways is to TPU accelerators what Borg is to general compute. But Gemini pointed out important differences: Borg schedules containers at second-to-minute granularity; Pathways schedules compiled functions at millisecond granularity, requiring absolute ordering consistency (gang scheduling) to avoid deadlock on TPU's non-preemptible kernels. Pathways also deeply manages data movement across accelerator interconnects, which isn't something Borg concerns itself with. The resource management philosophy feels similar, but the execution constraints are very different.

Parallel Asynchronous Dispatch

This is probably my favorite part of the paper, for a few reasons. First, it shows an extremely deep understanding of performance analysis — optimizing away small scheduling and coordination delays that only matter if you know exactly where your time is being spent. Second, it requires intimate knowledge of the workload: if you don't understand how long the main computation takes, you can't know that the host-side work is the bottleneck worth eliminating. Third, it requires deep hardware understanding — knowing that host-side work (scheduling, resource allocation, buffer setup) can be done in advance because compiled functions have statically known resource usage.

The standard async dispatch works well when computation takes longer than host-side work. When computation times are short, the host-side work becomes the bottleneck. Pathways exploits the fact that compiled functions have statically known resource usage — the system knows what a successor node needs before its predecessor starts, so it runs the host-side work for multiple nodes in parallel.

This reminds me of cache warmup in serving systems — when you know what data will be needed, you prefetch before the request arrives. The difference is that serving warmup is often probabilistic, while here it's deterministic (the computation graph is known ahead of time), making the optimization even more effective.

One thing I wonder about: what happens when the predecessor computation takes much longer than expected? Resources have been allocated for successors that aren't needed yet. I think the answer lies partly in the hardware: XLA-compiled functions on TPU have highly deterministic execution times compared to GPU kernels with dynamic thread scheduling and unpredictable cache behavior. The system can schedule ahead confidently because it can reliably predict the resource footprint and timing. Pathways also falls back to sequential dispatch when resource requirements aren't known until a predecessor completes (e.g., data-dependent control flow), which is a reasonable safety net.

The whole paper demonstrates strong expertise throughout — it's written by an incredible group of systems researchers, and the level of insight into both the hardware and the workload is genuinely inspiring.

Data Management

Section 4.6 covers data buffer lifecycle management. Each host has a sharded object store (similar to Ray's, extended for HBM). Objects are tagged with ownership labels for garbage collection on failure, and back-pressure stalls computations that can't allocate memory. This is necessary for any dataflow system at this scale — without it, long-running jobs accumulate leaked memory, and bursty concurrent programs crash the accelerators.

The Evaluation

The evaluation proves the main claim: Pathways matches multi-controller JAX performance for realistic workloads while offering single-controller flexibility. The overhead thresholds are low enough (2.3ms for 128 TPUs, 35ms for 2048 TPUs) that real training steps mask them completely. The multi-tenancy results show zero context-switch overhead when multiplexing, and the cross-island pipelining maintains throughput even over DCN.

One thing I noticed: the evaluation only benchmarks text-to-text Transformer models. Given that Pathways was motivated by multi-modal and sparse computation, why not include those? Gemini suggested a reasonable explanation: systems papers that introduce radical new architectures first need to prove they haven't regressed on existing workloads. The paper acknowledges in Section 6.3 that the programming model for data-dependent vectorized control flow was still future work. This makes sense — it's an industry paper, closer to a technical report. The infrastructure was ready, but the user-facing API wasn't finalized.

Things I'm Still Thinking About

TPU vs. GPU and practical applicability. Pathways is deeply TPU-specific: XLA's ability to fuse long-running computations, TPU's non-preemptibility requiring gang scheduling, the large ICI-connected islands. The paper suggests the high-level architecture should transfer to GPUs, but the question is how much practical value this has for the broader ecosystem. Almost no one outside Google has TPU hardware — everyone else is on NVIDIA GPUs. The design would need significant adaptation for GPU clusters with their different interconnect topologies, kernel scheduling models, and communication libraries. Google also has the advantage of building both the hardware and the orchestration layer — if Pathways needs a TPU behavior change, a future TPU generation can accommodate it. That said, Pathways on Cloud seems to be gaining traction as a GCP offering. I've seen users praising how much easier it makes scaling training jobs, which suggests it could be becoming a meaningful selling point for Google Cloud.

Sparse activation and elastic compute. MoE is already a form of dynamic compute allocation — different requests get routed to different experts within a shared model. But what I keep wondering is: could you go further? Instead of training separate models at different sizes (large, medium, small), could you train one model and activate different fractions of it depending on task complexity? Techniques like speculative decoding are already doing something related — using a small draft model to predict tokens for a larger model to verify. Elastic training is already available on Pathways on Cloud. I think elastic inference — dynamically scaling activation per request — is a natural direction, and Pathways' resource virtualization and centralized scheduling seem like the right kind of infrastructure to enable it.

From paper to production. The Pathways on Cloud documentation shows how the paper's design choices have played out. The resource virtualization layer now supports transparent suspend/resume for preemptible instances, persistent compilation caches, and distributed checkpointing where workers write weight shards directly to Cloud Storage in parallel. These features trace directly back to the single-controller design bet the paper made in 2022.

Will papers like this keep appearing? I wonder whether the current competitive landscape will continue producing papers this detailed about core training infrastructure. Publishing this level of engineering detail does give competitors useful information. But maybe I'm wrong, and companies will keep finding value in sharing. Either way, I'm glad this one exists.

Where does it break? Gemini raised some sharp points that I think are worth noting. The paper presents the happy path beautifully, but experienced engineers tend to evaluate systems by how they fail. A few things I'd want to understand better: First, the O(1) abstraction of the sharded dataflow means the single controller is deliberately blind to per-shard states. But when one of 2048 TPUs silently drops a DCN packet and a future never resolves, how does the system even detect that? High abstraction tends to come at the cost of debuggability, and day-to-day operations on a system like this could be painful. Second, the paper shows that 35ms of computation masks the single-controller overhead at 2048 TPUs — but where's the actual ceiling? Plaque's tagged data tuples still consume DCN bandwidth, and at some scale the bandwidth tax of the coordination layer itself must start to matter. I'd love to see data on where this breaks. Third, gang scheduling across hundreds of physical machines with strong consistency guarantees, while also maintaining high utilization, is close to an NP-hard scheduling problem in practice. The paper's FIFO scheduler is honest about being simple, but in a real multi-tenant cloud environment, the fragmentation cost of reserving large symmetric slices could be significant.

Reading just this single paper took a lot of time, and it builds on a foundation of prior systems (TensorFlow, XLA, Plaque, JAX) each of which is its own deep rabbit hole. The reading lists out there (GPU MODE's awesome ML systems, Bodun Hu's resources) are long and humbling. This is an area where even experienced researchers wouldn't claim to fully grasp the whole picture. For the rest of us, the only viable strategy is to keep reading, keep building, and stay endlessly curious.

References

Never Trust the Model

Cunzhe — Fri, 03 Apr 2026 06:17:08 GMT

Photo & Feedback: Vincent Tjeng

Case 1: The SIGSEGV That Had No Pattern

We had a server crash caused by segfault. No obvious pattern, binary rollback didn't help, core dumps didn't provide useful information, and experiments didn't surface anything unusual. The crashes only happened occasionally but frequently enough to make me lose sleep at night.

After extensive log analysis, we traced it to a vector out-of-bounds access. The root cause: an engineer was processing the output of a language model that was supposed to return three categories of information, each on a separate line. The code parsed the response by splitting on newlines, assumed there would always be more than three elements, and immediately erased the first three entries from the vector. No bounds check. No validation that the model actually returned what was expected.

For months, this worked fine — the model reliably followed the instruction format. Then, access to the newer preview model we were using expired, forcing us to fall back to an older, stable version. The older model had weaker instruction-following capabilities. It would occasionally return two lines instead of three, or merge categories together. The code didn't check, and the vector erase went out of bounds. SIGSEGV.

Why This Was Hard to Catch

The engineer had done model performance comparison — but only end-to-end metrics. The overall quality numbers looked fine. The edge case where the model deviated from the expected output format was rare enough that it didn't show up in aggregate benchmarks. And with language models, edge cases are effectively infinite — you can't test every possible output format variation.

The Real Fix

This isn't about one engineer making a mistake. When you build infrastructure that consumes model output, you're building on top of a non-deterministic foundation. Any code that treats model output as structurally guaranteed is carrying a latent bug — and it will surface exactly when you least expect it: during a model migration, a config change, or a subtle shift in input distribution.

"Treat LLM output like untrusted user input" is the slogan, but the engineering response has to be more concrete than that. A few specific things that, in retrospect, would have caught this:

Constrained decoding / structured output. Modern serving frameworks (OpenAI's JSON mode, Gemini API's structured output, Outlines, xgrammar) can constrain decoding to a grammar at the token level. Done right, this turns format adherence from a probabilistic property into a hard guarantee. But it isn't free — constrained decoding can hurt throughput on complex grammars, and on models with weaker instruction-following it can produce nonsense content that fits the schema but says nothing useful. So it's a tool, not a fix-all. The right pattern is: use constrained decoding to enforce shape, then still validate semantics.

Schema validation in the consumer. Even with structured output, the consumer code should parse against an explicit schema (Pydantic, proto, JSON schema) and route validation failures to a designed error path. The bug in our case was implicit: an erase(0, 3) on a vector with no length check. With a schema, the same code becomes parsed = Schema.validate(model_output); use(parsed.field_a, parsed.field_b, parsed.field_c) — and the failure becomes a typed exception instead of a SIGSEGV.

Monitoring on output distribution, not just end-to-end metrics. This was the actual gap in our case. The team had benchmarked the model swap on end-to-end quality and decided it was fine. Nobody monitored the shape of the output — how often it returned exactly three lines, how often the first line matched the expected category vocabulary. Distribution monitoring on producer output catches drift that aggregate quality metrics smooth over.

Canary the model migration. A model swap should look more like a binary rollout than a config flip. Send a small fraction of traffic to the new model, compare both quality and output distribution against the old model, and only ramp up if both look healthy. The team had infrastructure for this for code rollouts but didn't apply it to the model swap.

Stepping back: the bug isn't really about LLMs being non-deterministic. It's about an implicit contract between a producer and a consumer breaking when the producer changed. The same class of bug shows up every time an upstream service revises a response schema, or a data pipeline silently shifts a column type. What makes LLM output more dangerous is that the contract was never explicit in the first place — there was no schema, no version, no validation hook. The fix isn't "trust LLMs less" — it's "make the contract explicit, and then enforce it."

Case 2: When My AI Tutor Fabricated an Architecture Argument

While studying Google Vizier's distributed architecture, I had a long conversation with Gemini about how Vizier handles concurrent workers requesting hyperparameter suggestions simultaneously. For context, Vizier is Google's internal black-box optimization service — workers request parameter suggestions via RPC, run training, and report results back. The paper describes its storage layer only as a "Persistent Database."

The Fabrication

Gemini claimed that Vizier relies on Spanner's strong consistency to prevent concurrent workers from receiving identical parameter suggestions — arguing that without it, thousands of workers could read the same stale snapshot and get duplicate configurations, causing "catastrophic" compute waste. It even produced a direct quote from the paper: "All state is stored in a distributed database (Google's Spanner [8]), which provides high availability and consistency."

This sounded convincing. Strong consistency preventing duplicate work in a distributed system — textbook argument. It had a citation with a specific reference number. I moved on, fully believing I'd seen this sentence in the paper.

That quote doesn't exist. The word "Spanner" appears nowhere in the Vizier paper. Not once. And reference [8]? It's Desautels et al.'s work on Gaussian Process Bandit Optimization — completely unrelated to Spanner.

Gemini fabricated the quote, the attribution, and the reference number. And the fabrication was convincing enough that even when I later reviewed the blog draft with Claude, I didn't question the Spanner reference — I genuinely believed I had read that sentence. It had been planted in my memory by a hallucination.

This is the most dangerous thing about LLM hallucinations: they can contaminate your own memory. Once you've "seen" a fabricated quote with a specific citation, your brain files it away as something you've verified. It took going back to the actual PDF and ctrl-F'ing "Spanner" to realize the entire foundation of the argument was invented.

Beyond the fabricated citation, Gemini also inflated the scenario. The paper mentions scalability to "thousands of parallel trial evaluations per study," but a typical Study has tens to hundreds of workers. And "catastrophic" compute waste from a handful of duplicate Bayesian optimization trials among hundreds? That's a mild inefficiency, not a disaster.

The Pushback

I asked Gemini for a specific reference. It pointed to sections of the paper, but the evidence wasn't there. I asked it to check the open-source Vizier repository. At that point, Gemini reversed its position — admitting that concurrency is actually handled at the algorithm layer (through "pending trials" / hallucinated responses in the Bayesian model), not at the database layer.

What's interesting about that real design — once you strip away the fabrication — is that the Vizier team chose an algorithmic solution over an infrastructural one. The naive distributed-systems instinct is "use strong consistency at the database to prevent two workers from getting the same configuration." That works, but it solves the wrong problem. The right problem is "ensure that subsequent suggestions account for what we've already handed out, even if those trials haven't finished yet."

The pending-trials approach handles this elegantly: when worker A is given configuration X, the GP adds (X, predicted_outcome) as a soft observation. The next worker's suggestion naturally biases away from X, because the GP "thinks" it has already explored that region. No locking, no serializable transactions, no global ordering — the deduplication emerges from the optimization itself, and the global information about what's been explored is preserved. Strong consistency at the database would have been a hammer; pending trials is a scalpel.

The thing that bothers me about the original fabrication, in retrospect, is that the design choice Vizier actually made is more interesting than the one Gemini invented. A more careful reader would have spotted "Spanner for trial deduplication" as a strange architectural decision — strong consistency is expensive, and there's a cleaner algorithmic option sitting right next to the problem. I didn't push on that intuition. I trusted the citation.

The Sycophancy Problem

Here's what bothered me most: Gemini didn't just correct the factual error. It abandoned its entire engineering argument — including the parts that were defensible.

The fabricated quote was a real problem. But the underlying observation that "concurrent workers without coordination would explore redundantly" was correct, and the question of how to coordinate them is a real and interesting design problem. Vizier's actual answer — algorithmic, not infrastructural — was a sharper version of the same conversation. When I pushed back on the citation, Gemini didn't say "you're right, the citation is wrong, but the underlying concurrency problem is real, and here's how Vizier actually solves it." It collapsed to "okay, it's only a slight inefficiency, never mind."

The model has no conviction. It doesn't distinguish between "I was wrong about the facts" and "I was wrong about the reasoning." It just capitulates. The result, ironically, was that I had to do extra work to recover the correct engineering argument from the wreckage of the fabricated one.

Both of these experiences taught me the same thing from different angles. In production, the crash happened because the code treated an LLM's output as having a guaranteed format. In learning, I almost published a blog post built on a fabricated citation because I treated an LLM's confident citation as a guarantee. The failure mode is the same: treating output from a probabilistic system as if it carried the kind of contract you'd expect from a typed interface.

The server crash cost us debugging hours. The hallucination almost cost me credibility. The fix in both cases is the same: make the contract explicit, and verify before building on top of it.

What's Next

This experience reinforced something I wrote about in my first blog: you can't build good infrastructure for a system you don't understand. Next up, I'm working through transformer internals with Karpathy's neural network series — same approach, infra perspective.

References:

From Adam to Mixed Precision

Cunzhe — Thu, 02 Apr 2026 11:33:35 GMT

Photo: Vincent Tjeng

I've spent four years as an ML Infra engineer working on the platform layer — AI agents, serving infrastructure, search and recommendation systems. The work I enjoy most is the kind where you can't just fix the surface symptom; you have to understand what's happening several layers down. A client asks why their request is slow, and what starts as a latency investigation turns into a puzzle about distributed system behavior.

But I've been operating on top of the ML stack without fully understanding what's inside it. A colleague once told me: "No matter what kind of request it is, it eventually has to run on a physical machine." So I decided to go back to fundamentals. This blog is a record of that.

The Starting Point: Why Does Adam Eat So Much Memory?

My re-learning started with Karpathy's building micrograd. While working through it, I recalled an article about calculating a model's memory footprint during training, which specifically called out Adam's massive GPU memory consumption. I'd always used Adam as the default optimizer — the thing everyone reaches for without thinking twice. But I'd never understood why it's so memory-hungry.

Adam in 60 Seconds

A model's parameters are just numbers — weights that influence the output. The gradient of a parameter is its first derivative with respect to the loss function: it tells you how much changing that parameter affects the loss. Think of the parameter as position, the gradient as the force acting on it.

Gradient descent is simple: compute the gradient, step in the opposite direction. w_new = w_old - α * gradient, where α (the learning rate) controls how hard you pull. But vanilla gradient descent is terrible at navigating complex loss landscapes — it oscillates in steep narrow valleys and crawls across flat plateaus.

Adam (Adaptive Moment Estimation) fixes this with two mechanisms:

First moment (momentum): A running average of past gradients. If gradients have been consistently pointing the same direction, momentum builds up and you accelerate — like a heavy ball rolling downhill that powers through small bumps.
Second moment (adaptive learning rate): A running average of squared gradients. If gradients have been volatile, this value spikes, and Adam automatically shrinks the step size for that parameter.

The step looks roughly like: step ≈ (learning_rate × momentum) / (√volatility + ε). Momentum in the numerator accelerates you; volatility in the denominator slows you down. Each parameter gets its own individually tuned step size.

This "accelerate when smooth, brake when rough" pattern is essentially TCP congestion control. Slow start with exponential window growth maps onto momentum accumulation — when ACKs come back fast (gradients are consistent), you aggressively ramp up. Multiplicative decrease when you hit congestion maps onto the second moment clamping down the step size when gradients become volatile. Even TCP's Fast Recovery has a parallel: when TCP detects mild packet loss (duplicate ACKs rather than a full timeout), it halves the window and continues probing rather than slamming back to zero. Similarly, Adam's second moment uses an exponential moving average rather than reacting to instantaneous gradient spikes — it distinguishes between temporary turbulence and genuine divergence. The math is different, but the design intuition is the same: don't overreact to transient noise, don't ignore real trouble.

One nuance Claude pointed out when I discussed this: optimizer choice can actually affect final model quality, not just training speed. Different optimizers tend to land in different local minima — Adam tends to find sharper minima (which may generalize slightly worse), while SGD with momentum sometimes finds flatter ones (which tend to generalize better). This is apparently why some large model training runs switch from Adam to SGD in later stages.

The Memory Problem

Here's where it matters for infra. With basic SGD, you store two things per parameter: the weight and its gradient. With Adam, you also store the first moment and the second moment — per parameter.

In mixed-precision training, these optimizer states must be kept in FP32 (4 bytes each) for numerical stability. That means Adam adds 12 extra bytes per parameter: an FP32 master weight copy, plus the FP32 first and second moments.

For a 7B parameter model, Adam's optimizer states alone consume roughly 84 GB of GPU memory — before you count the model weights, gradients, or activations. This is why ZeRO exists: it shards optimizer states across GPUs because no single GPU can hold it all.

Bias Correction: A Cold-Start Fix

Adam includes a clever fix for what amounts to a cold-start problem. Both moments are initialized to zero. In the first few steps, the exponential moving average is heavily biased toward zero (the history is almost entirely zeros), so the optimizer thinks the real gradients are tiny. The model barely moves.

Bias correction divides by (1 - β^t), which is small in early steps (amplifying the estimate to its true scale) and converges to 1 as training progresses (at which point the correction silently disappears). Without it, Adam loses its "just works out of the box with default hyperparameters" property — and that robustness is arguably what made it dominate the industry.

This "cold-start correction that gracefully fades out" pattern feels borrowable in other engineering contexts — anywhere you're bootstrapping a running estimate from zero history.

A Side Note: Adam vs. Google Vizier

Before this deep dive, I'd been fuzzy on where Adam sits relative to systems like Google Vizier. Both involve "optimization," so it's an easy conflation.

Adam is a gradient-based optimizer that runs inside the training loop, updating weights step by step. Vizier is a black-box optimization service that sits outside the loop, managing entire training runs to find the best hyperparameters. Vizier might find the best learning rate and beta values for your Adam optimizer. They're complementary systems at different abstraction levels.

Mixed Precision: Not About Saving Space

I'd seen "mixed-precision training" referenced countless times but never deeply understood the why. My instinct was that it was a memory optimization — use half the bytes, fit more on the GPU. That's partially true, but it misses the real story.

Why FP32 Master Weights Can't Be Negotiated

In mixed-precision training, forward and backward passes run in half precision (FP16 or BF16, 2 bytes per value). But master weights and Adam's optimizer states must stay in FP32.

The reason is numerical swamping. In later stages of training, Adam's updates become extremely small — on the order of 0.00001. If your current weight is 1.0:

In FP32: 1.0 - 0.00001 = 0.99999. The update is faithfully recorded.
In FP16: The mantissa doesn't have enough bits. The tiny update gets silently rounded away: 1.0 - 0.00001 = 1.0.

The gradient was computed correctly, Adam did its job correctly, but the weight doesn't move. Training silently stalls. This is why FP32 master weights are non-negotiable — you're trading 2x memory for the ability to actually converge.

The Real Reason for Half Precision: Hardware Physics

So if we're stuck with FP32 for optimizer states anyway, why bother with half precision for forward and backward passes?

This was my "aha" moment: mixed precision is fundamentally about compute throughput and memory bandwidth, not storage capacity.

Modern GPUs have Tensor Cores physically optimized for FP16/BF16 matrix multiplication. On an NVIDIA A100: ~19.5 TFLOPS for FP32 versus ~312 TFLOPS for FP16 on Tensor Cores. That's roughly a 16x speedup. Is that sliver of extra FP32 precision per operation worth a 16x slowdown? No.

Then there's the memory bandwidth wall. In large model training, the bottleneck is often how fast you can feed data from HBM to the compute cores (SRAM). Every FP16 value is 2 bytes instead of 4, cutting data transfer volume in half. In a bandwidth-bound workload, that's the difference between keeping Tensor Cores busy and having them sit idle waiting for data.

And there's activation memory — the intermediate results from each layer saved for backpropagation. Unlike model parameters and optimizer states (fixed once the model is defined), activations scale dynamically with batch size and sequence length. In the era of 128K+ context windows, activation memory can explode. Cutting it from FP32 to FP16 directly frees space for larger batches or longer contexts.

Activation memory in training can't be optimized the same way as in inference. In inference, KV cache avoids recomputing past tokens' keys and values during autoregressive generation. But training processes the entire sequence at once, and backpropagation needs the full activations from every layer — not just the final results, but the entire intermediate computation trail. The training-side answer is gradient checkpointing: selectively discarding activations during the forward pass and recomputing them during backward, trading compute for memory.

A Quick Note on BF16

BF16 (Brain Floating Point) reveals a hardware-software co-design philosophy. FP16's problem: its exponent is too small (5 bits), giving it a narrow dynamic range and making gradient overflow/underflow a real risk. Google designed BF16 for their TPUs by keeping the same 8-bit exponent as FP32 (preserving full dynamic range) but truncating the mantissa. Less precision, far more robust in practice. It's now the default on NVIDIA's newer GPUs as well — a case where a TPU hardware design choice propagated back and reshaped the entire industry's training practices.

What's Next

Next up: transformer internals from an infra perspective, continuing with Karpathy's neural network series.

References: