Skip to main content

Command Palette

Search for a command to run...

Understanding Pathways: How Google Scales to Thousands of TPUs

Published
16 min read
Understanding Pathways: How Google Scales to Thousands of TPUs
C
LLM Infra and other things My writing style: note with Typeless, discuss with Gemini, edit with Claude

Photo: Vincent Tjeng


Google presented the Pathways vision back in 2021: train a single large model that can do millions of things. At the time, ChatGPT didn't exist yet, and this idea felt genuinely novel. Looking back from 2026, the vision has been executed remarkably well — Mixture of Experts, multi-modality, and the "generalist" foundation model have all become industry standards.

But the Pathways paper (Barham et al., MLSys 2022) is not about any of those things. It's about something more fundamental: how do you build a distributed system — what the paper calls "a new large scale orchestration layer for accelerators" — that can actually support the research needed to get there?

I spent some time reading this paper, and found it genuinely impressive. The design is creative — a lot of the solutions surprised me. At the same time, the paper kept reminding me of classic systems ideas — resource virtualization, async dispatch, gang scheduling, dataflow execution — repurposed and combined to solve the specific constraints of large-scale ML training. That combination of fresh thinking and deep roots in established systems design is what makes it interesting. Understanding the problem first, then the solution, and then why this solution, was how I tried to approach it.

Why This Is a Hard Problem

One thing I kept thinking about while reading is how different distributed training is from the distributed systems I'm more familiar with in online serving.

In a typical distributed backend — a recommendation serving system, for example — when you hit a scaling wall, the brute-force approach is just adding more machines. And yes, that sometimes works. But in practice there's usually a lot of infrastructure work involved: revamping the serving framework, optimizing the data pipeline, rethinking the caching strategy. The point is, there are many well-understood tools in the toolbox, and the latency of individual requests is mostly independent — if one replica is slow, it doesn't block the others.

Distributed training doesn't work this way. You have a single computation split across thousands of accelerators, and that computation has thousands of sequential steps. Every step requires all accelerators to synchronize — exchange gradients, agree on the updated weights, then proceed. A single slow node holds up the entire fleet. The latency isn't per-request — it's cumulative across every synchronization point.

This actually reminds me of large-scale data pipelines. When a single machine can't handle a heavy multi-step computation, you start parallelizing — first multi-threaded on one machine, then distributed across many, which is how frameworks like MapReduce and Spark came about. The progression from SPMD to what Pathways does feels like a similar evolution: the scale forces you to rethink the programming and coordination model.

Another important difference is fault tolerance. In online serving, you can often fall back to a default value or accept degraded performance when something fails. In training, you can't just skip a node's result and move on to the next step — the computation depends on every participant completing.

This is more HPC territory than cloud service design, and I think it's the key reason why training infrastructure has become its own discipline. Interestingly though, as we'll see, Pathways solves these HPC-scale problems by borrowing architectural concepts that are deeply rooted in serving systems — async dispatch, resource multiplexing, centralized scheduling.

SPMD and Its Limits

Most training systems at the time used SPMD (Single Program Multiple Data), inspired by MPI. Every accelerator runs the same program, processes different data, and communicates through collectives like AllReduce. In multi-controller SPMD setups (like JAX or PyTorch DDP), each host runs its own copy of the program and dispatches computations over fast PCIe links with minimal coordination overhead. Communication happens through dedicated interconnects like NVLink or TPU's ICI.

But the paper identifies three walls that were becoming increasingly problematic:

Pipelining. Very large language models can't fit on a single accelerator. A common solution is pipeline parallelism — splitting the model into stages across devices. I briefly touched on this in a previous post. But pipelining is inherently heterogeneous — different stages run different parts of the model. Researchers built workarounds (GPipe, PipeDream, Megatron), but they were essentially hacking MPMD behavior on top of an SPMD runtime.

Computational sparsity. Models like Mixture of Experts activate only a subset of parameters per input, requiring data-dependent routing — the kind of heterogeneous control flow SPMD wasn't designed for.

Resource acquisition. Getting a large, symmetric block of accelerators is expensive. It's much easier to get several smaller "islands." But SPMD assumes exclusive ownership of a single homogeneous pool, pushing toward MPMD setups.

The Core Tension: Single-Controller vs. Multi-Controller

The fundamental architectural choice comes down to who controls the accelerators:

Multi-controller systems (JAX, PyTorch DDP) run a copy of the user program on every host. Dispatch is fast — just a PCIe call. But coordination beyond standard collectives requires custom implementation, and there's no centralized view of the cluster for resource management or scheduling.

Single-controller systems (TensorFlow v1) offer a flexible programming model: a central client builds a computation graph and partitions it across workers. This gives you resource virtualization, centralized scheduling, and arbitrary computation patterns. But TF v1 ran into three real problems:

Dispatch latency. Every dispatch goes over DCN instead of local PCIe — an order of magnitude slower. For pipelined models with many cross-host transfers, these latencies accumulate.

Gang scheduling. When multiple programs share accelerators, communicating computations must be enqueued in consistent order. TPUs are single-threaded and run non-preemptible kernels — inconsistent ordering means deadlock. TF v1 could enforce ordering within one program but not across programs. Bodun Hu's post covers this deadlock issue well.

Graph explosion. A naive dataflow graph between an M-way sharded computation and an N-way sharded computation requires M + N nodes and M × N edges. At thousands of shards, this becomes unmanageable.

The Pathways thesis: you can have the programming flexibility of single-controller with the performance of multi-controller.

How Pathways Solves It

Compiled Functions and JAX Integration

Pathways can be integrated as a backend for JAX. Users wrap Python code with decorators to create "compiled functions" — XLA computations whose input/output shapes and resource requirements are known before execution. This is a useful property inherited from JAX: because resource requirements are known upfront, the system can plan ahead.

The paper also describes a program tracer (Section 3, Figure 2) that wraps a block of Python code calling many compiled functions and generates a single Pathways program — a dataflow graph where each compiled function becomes a computation node. This avoids the overhead of a separate Python call and RPC for each function, which matters when you're chaining many operations back to back.

Resource Virtualization

Pathways introduces virtual devices mapped to physical devices by a centralized resource manager — conceptually similar to virtual memory but for accelerator resources. The initial implementation is deliberately simple, but the abstraction enables things like transparent suspend/resume and migration without user cooperation. The Pathways on Cloud documentation shows that elastic training and transparent preemption handling have since been built on this.

Decoupling Control and Data Planes

Two of Pathways' design choices — the sharded dataflow via Plaque and the sharded buffer abstraction — are really two sides of the same coin: keeping the control plane lean by operating at a coarser granularity than the data plane.

Sharded dataflow via Plaque. A quick note on terminology: "dataflow" here refers to the computation model where operations execute when their input data arrives — not Google Cloud Dataflow (the Apache Beam-based data processing product). Same underlying concept, very different systems.

Pathways relies on Plaque — a closed-source production sharded dataflow system used at Google — for all cross-host coordination over DCN. In a naive implementation, chaining two computations A and B, each sharded across N devices, produces a graph with 2N nodes and potentially N² edges. In Plaque's representation, the same chain requires only 4 nodes (Arg → Compute(A) → Compute(B) → Result) regardless of N. The N data tuples flow between these logical nodes tagged with destination shards.

I had to think about this for a while, because the physical data transfers don't disappear. If the model requires an All-to-All exchange, the hardware still moves all that data. What changes is that the control plane doesn't manage those connections individually. In a discussion about this, Gemini framed it well: the central controller only issues lightweight high-level directives, while the thousands of local executors and network cards handle the actual data routing independently. The controller writes the checks; the distributed data plane cashes them. This inversion of responsibility is what makes single-controller viable at scale — the brain stays lean by delegating the heavy lifting to autonomous local agents.

The paper notes that this design could be re-implemented using Ray instead of Plaque, though additions like an HBM object store and GPU interconnect primitives would be needed. Siyuan Zhuang's Zhihu analysis does a thorough comparison of Pathways and Ray's designs — many of their architectural choices are strikingly similar.

Sharded buffer abstraction. The same principle applies to memory management. In older single-controller systems, the client becomes a bottleneck tracking thousands of individual shards and buffers. Pathways introduces a sharded buffer abstraction — a logical buffer distributed over multiple devices, with bookkeeping amortized at the logical level rather than per-shard.

Both of these are instances of a pattern that shows up everywhere in systems: adding an indirection layer and choosing the right granularity for management. Virtual memory does the same thing for physical RAM. PagedAttention in vLLM applies the same idea to KV cache in inference — the core concept of decoupling logical and physical layout through a mapping table feels like the same fundamental move applied to a different problem. Whether you're aggregating many small shards into one logical unit (Pathways) or slicing one large allocation into many manageable pages (PagedAttention), the underlying technique is the same: decouple the abstraction your logic sees from the physical reality underneath. As David Wheeler famously put it: "All problems in computer science can be solved by another level of indirection."

Gang Scheduling

Pathways includes a centralized scheduler per island that consistently orders all computations, currently using FIFO with the architecture supporting more sophisticated policies.

In the Gemini conversation about this paper, I compared Pathways to Borg — both are centralized systems that pool and virtualize shared hardware resources, allocating them across multiple concurrent workloads. The parallel feels right at the macro level: Pathways is to TPU accelerators what Borg is to general compute. But Gemini pointed out important differences: Borg schedules containers at second-to-minute granularity; Pathways schedules compiled functions at millisecond granularity, requiring absolute ordering consistency (gang scheduling) to avoid deadlock on TPU's non-preemptible kernels. Pathways also deeply manages data movement across accelerator interconnects, which isn't something Borg concerns itself with. The resource management philosophy feels similar, but the execution constraints are very different.

Parallel Asynchronous Dispatch

This is probably my favorite part of the paper, for a few reasons. First, it shows an extremely deep understanding of performance analysis — optimizing away small scheduling and coordination delays that only matter if you know exactly where your time is being spent. Second, it requires intimate knowledge of the workload: if you don't understand how long the main computation takes, you can't know that the host-side work is the bottleneck worth eliminating. Third, it requires deep hardware understanding — knowing that host-side work (scheduling, resource allocation, buffer setup) can be done in advance because compiled functions have statically known resource usage.

The standard async dispatch works well when computation takes longer than host-side work. When computation times are short, the host-side work becomes the bottleneck. Pathways exploits the fact that compiled functions have statically known resource usage — the system knows what a successor node needs before its predecessor starts, so it runs the host-side work for multiple nodes in parallel.

This reminds me of cache warmup in serving systems — when you know what data will be needed, you prefetch before the request arrives. The difference is that serving warmup is often probabilistic, while here it's deterministic (the computation graph is known ahead of time), making the optimization even more effective.

One thing I wonder about: what happens when the predecessor computation takes much longer than expected? Resources have been allocated for successors that aren't needed yet. I think the answer lies partly in the hardware: XLA-compiled functions on TPU have highly deterministic execution times compared to GPU kernels with dynamic thread scheduling and unpredictable cache behavior. The system can schedule ahead confidently because it can reliably predict the resource footprint and timing. Pathways also falls back to sequential dispatch when resource requirements aren't known until a predecessor completes (e.g., data-dependent control flow), which is a reasonable safety net.

The whole paper demonstrates strong expertise throughout — it's written by an incredible group of systems researchers, and the level of insight into both the hardware and the workload is genuinely inspiring.

Data Management

Section 4.6 covers data buffer lifecycle management. Each host has a sharded object store (similar to Ray's, extended for HBM). Objects are tagged with ownership labels for garbage collection on failure, and back-pressure stalls computations that can't allocate memory. This is necessary for any dataflow system at this scale — without it, long-running jobs accumulate leaked memory, and bursty concurrent programs crash the accelerators.

The Evaluation

The evaluation proves the main claim: Pathways matches multi-controller JAX performance for realistic workloads while offering single-controller flexibility. The overhead thresholds are low enough (2.3ms for 128 TPUs, 35ms for 2048 TPUs) that real training steps mask them completely. The multi-tenancy results show zero context-switch overhead when multiplexing, and the cross-island pipelining maintains throughput even over DCN.

One thing I noticed: the evaluation only benchmarks text-to-text Transformer models. Given that Pathways was motivated by multi-modal and sparse computation, why not include those? Gemini suggested a reasonable explanation: systems papers that introduce radical new architectures first need to prove they haven't regressed on existing workloads. The paper acknowledges in Section 6.3 that the programming model for data-dependent vectorized control flow was still future work. This makes sense — it's an industry paper, closer to a technical report. The infrastructure was ready, but the user-facing API wasn't finalized.

Things I'm Still Thinking About

TPU vs. GPU and practical applicability. Pathways is deeply TPU-specific: XLA's ability to fuse long-running computations, TPU's non-preemptibility requiring gang scheduling, the large ICI-connected islands. The paper suggests the high-level architecture should transfer to GPUs, but the question is how much practical value this has for the broader ecosystem. Almost no one outside Google has TPU hardware — everyone else is on NVIDIA GPUs. The design would need significant adaptation for GPU clusters with their different interconnect topologies, kernel scheduling models, and communication libraries. Google also has the advantage of building both the hardware and the orchestration layer — if Pathways needs a TPU behavior change, a future TPU generation can accommodate it. That said, Pathways on Cloud seems to be gaining traction as a GCP offering. I've seen users praising how much easier it makes scaling training jobs, which suggests it could be becoming a meaningful selling point for Google Cloud.

Sparse activation and elastic compute. MoE is already a form of dynamic compute allocation — different requests get routed to different experts within a shared model. But what I keep wondering is: could you go further? Instead of training separate models at different sizes (large, medium, small), could you train one model and activate different fractions of it depending on task complexity? Techniques like speculative decoding are already doing something related — using a small draft model to predict tokens for a larger model to verify. Elastic training is already available on Pathways on Cloud. I think elastic inference — dynamically scaling activation per request — is a natural direction, and Pathways' resource virtualization and centralized scheduling seem like the right kind of infrastructure to enable it.

From paper to production. The Pathways on Cloud documentation shows how the paper's design choices have played out. The resource virtualization layer now supports transparent suspend/resume for preemptible instances, persistent compilation caches, and distributed checkpointing where workers write weight shards directly to Cloud Storage in parallel. These features trace directly back to the single-controller design bet the paper made in 2022.

Will papers like this keep appearing? I wonder whether the current competitive landscape will continue producing papers this detailed about core training infrastructure. Publishing this level of engineering detail does give competitors useful information. But maybe I'm wrong, and companies will keep finding value in sharing. Either way, I'm glad this one exists.

Where does it break? Gemini raised some sharp points that I think are worth noting. The paper presents the happy path beautifully, but experienced engineers tend to evaluate systems by how they fail. A few things I'd want to understand better: First, the O(1) abstraction of the sharded dataflow means the single controller is deliberately blind to per-shard states. But when one of 2048 TPUs silently drops a DCN packet and a future never resolves, how does the system even detect that? High abstraction tends to come at the cost of debuggability, and day-to-day operations on a system like this could be painful. Second, the paper shows that 35ms of computation masks the single-controller overhead at 2048 TPUs — but where's the actual ceiling? Plaque's tagged data tuples still consume DCN bandwidth, and at some scale the bandwidth tax of the coordination layer itself must start to matter. I'd love to see data on where this breaks. Third, gang scheduling across hundreds of physical machines with strong consistency guarantees, while also maintaining high utilization, is close to an NP-hard scheduling problem in practice. The paper's FIFO scheduler is honest about being simple, but in a real multi-tenant cloud environment, the fragmentation cost of reserving large symmetric slices could be significant.

Reading just this single paper took a lot of time, and it builds on a foundation of prior systems (TensorFlow, XLA, Plaque, JAX) each of which is its own deep rabbit hole. The reading lists out there (GPU MODE's awesome ML systems, Bodun Hu's resources) are long and humbling. This is an area where even experienced researchers wouldn't claim to fully grasp the whole picture. For the rest of us, the only viable strategy is to keep reading, keep building, and stay endlessly curious.

References

20 views