Two Network Topologies: Hierarchical vs. Flat

The Gigawatt Machine: NVIDIA, Google, and the Engineering of Scale 6/12

Dec 11, 2025

1. Introduction: Beyond the Node

In the previous articles, we explored how NVIDIA and Google define the “compute node”—NVIDIA through progressive rack-scale integration (72 to 572 GPUs) and Google through configurable Pods (up to 8,960 TPUs). Now we must confront what happens when you leave the node.

A trillion-parameter model cannot fit within any single node, no matter how large. Training GPT-5 or Gemini requires tens of thousands of chips working in perfect synchrony across hundreds of racks distributed throughout a data center. The challenge is not computational—each chip knows how to multiply matrices—but communications: how do you move petabytes of data per second across a building-scale fabric without turning the network into a bottleneck?

The physics of this problem forced both companies to abandon the traditional “fat tree” network topology used in cloud data centers. Instead, they engineered fundamentally different solutions:

NVIDIA’s approach: The Hierarchical Federation. A cluster is a federation of ultra-dense rack-scale supercomputers (NVL72, and in the future NVL144 and NVL572 nodes) connected by layers of electrical switches arranged in a Rail-Optimized topology. Data moves through a three-tier hierarchy: copper NVLink inside the rack, InfiniBand between racks within a pod, and inter-pod networking at the data center scale.

Google’s approach: The Flat Optical Mesh. A cluster is a single, massive optical fabric where every TPU connects to its six neighbors through dedicated circuit-switched paths. There is no hierarchy—the “network” is just the 3D Torus mesh of ICI links, with Optical Circuit Switches (OCS) dynamically reconfiguring the topology as needed.

This article examines how these two topologies handle the traffic patterns of AI training, and the traffic storms that would collapse a conventional network—and what happens when hardware fails in a supercomputer.

2. The Two-Domain Architecture: Super-Node vs. Scale-Out

Before diving into each company’s specific implementation, we must understand a fundamental pattern that has emerged in both architectures: the bifurcation of networking into two distinct domains.

As models scale beyond 100,000 chips, physics forces a structural reality that neither company explicitly planned but both discovered independently. A trillion-parameter cluster requires two domains with fundamentally different characteristics.

Domain 1: The “Super-Node” Fabric (Proprietary)

This is the domain of maximum bandwidth and tightest coupling. Here, the “network” acts less like a cable and more like a motherboard bus. Chips in this domain communicate as if they were all on the same die.

Defining characteristics:

Proprietary technology: Custom interconnects optimized for minimum latency
Tight coupling: Sub-microsecond latency, often with hardware-coherent memory
Physical constraints: Limited by the physics of the interconnect medium (copper range or optical circuit-switch scale)
Traffic type: Handles the most bandwidth-intensive operations that require instant synchronization

The strategic purpose: Keep the highest-traffic communication patterns—specifically operations like Tensor Parallelism, where a single matrix multiply is split across multiple chips—entirely within this domain. By doing so, these operations never touch the slower external network.

NVIDIA’s implementation: The GB200 NVL72 rack, where 72 GPUs connect via a copper backplane using NVLink 5.0, creating a 130 TB/s shared-memory domain. The rack’s physical boundaries (roughly 1 meter) are defined by copper signal integrity limits.

Google’s implementation: The TPU Pod, where thousands of chips (up to 9,216 in TPU v7) connect via the ICI optical mesh with Optical Circuit Switches. Unlike NVIDIA’s copper limitation, optical fibers allow the Pod to span multiple physical racks—though Google can provision domains of various sizes from the Pod. This flexibility comes at the cost of requiring compile-time scheduling rather than hardware coherence.

Domain 2: The Scale-Out Fabric (Standard)

When a workload exceeds the boundaries of Domain 1, it must cross into the external network. Scaling the proprietary super-node fabric across an entire building is impractical due to signal degradation, cabling complexity, and the operational risks of creating a single massive fault domain.

Defining characteristics:

Standard protocols: Based on industry standards (InfiniBand, Ethernet) for interoperability
Packet-switched: Dynamic routing, buffering, and congestion control
Building-scale: Can span hundreds of meters using optical transceivers
Multi-purpose: Must also connect to storage, CPUs, and external networks

The strategic purpose: Connect the super-node islands together, bridge to storage systems, and provide access to the outside world. Performance here matters, but flexibility and interoperability matter more.

NVIDIA’s implementation: InfiniBand (Quantum-X800) or Converged Ethernet (Spectrum-X), using Rail-Optimized topologies and in-network computing (SHARP) to maintain high performance while providing resilience through dynamic routing.

Google’s implementation: Jupiter Data Center Network, using standard Ethernet frames with customized protocols (Swift) for precise timing. This layer connects TPU Slices in Multislice configurations and bridges to Google’s massive storage infrastructure.

The Convergence Nobody Planned

Both companies have arrived at a similar two-domain structure through completely different paths:

Google started with a flat mesh (the ICI fabric) within a single Pod. To scale beyond the Pod maximum (9,216 chips for TPU v7), they developed Multislice—technology that connects multiple Pods via packet-switched Ethernet (Jupiter DCN). This created Domain 2.

NVIDIA started with a hierarchy (discrete servers connected by networks) and is aggressively expanding Domain 1 to encompass more compute. The NVL72 rack (and future NVL144, NVL572) turns what used to require networking into a single tightly-coupled node.

Both topologies now have two domains. Google’s flat mesh uses scale-out networking to grow beyond the Pod. NVIDIA’s hierarchy uses massive “flat” nodes to reduce networking. Zoom out, and we see huge islands of determinism connected by oceans of dynamic networking.

The philosophical difference remains: NVIDIA believes the network should adapt to hardware imperfections. Google believes the hardware should be perfectly configured by software. But the two-domain structure—proprietary super-nodes plus standard scale-out—has become the inevitable architecture of the Gigawatt era.

With this framework established, we can now examine how each company implements these two domains in practice.

3. NVIDIA’s Implementation: The Hierarchical Federation

NVIDIA’s architecture treats the supercomputer as a hierarchy of networks, each layer optimized for different bandwidth requirements and traffic patterns. Understanding this hierarchy requires examining each layer and how they interconnect.

Domain 1: The Rack as Super-Node (NVLink)

As established in Article 5, the GB200 NVL72 rack contains 72 GPUs connected by a massive copper backplane. This creates a 130 TB/s all-to-all fabric using NVLink 5.0, where every GPU can talk to every other GPU in the rack at full speed through passive copper traces.

NVIDIA’s roadmap expands this rack-scale integration: the Vera Rubin NVL144 (2026) will integrate 144 GPUs, and the Vera Rubin Ultra NVL572 (2027) will reach 572 GPUs—all within a single tightly-coupled super-node. This progressive expansion of Domain 1 is NVIDIA’s strategy for reducing dependency on the external network: the larger the rack-scale unit, the more computation stays within the ultra-low-latency domain.

Why copper: At these bandwidths (1.8 TB/s per GPU), electrical signals degrade rapidly. The NVL72 rack is engineered to keep all 72 GPUs within the maximum distance that passive copper can handle—roughly 1 meter. This avoids the latency and power tax of optical transceivers. (Saves ~200 nanoseconds per hop converting between light and electricity. Saves ~20 kW per rack in power consumption.)

The strategic implication: NVIDIA architected the rack-scale super-node to handle the most bandwidth-intensive operations—specifically Tensor Parallelism, where a single matrix operation is split across multiple GPUs—entirely within the copper domain. As the roadmap progresses from 72 → 144 → 572 GPUs, it’s possible to keep the highest-traffic operations for even trillion-parameter models within Domain 1, avoiding the external network entirely for the most latency-sensitive work.

The result: To the software, the entire rack looks like one giant GPU. NCCL (NVIDIA’s communication library) sees the rack as a single shared-memory domain where communication is instantaneous and lossless.

Domain 2: Connecting the Super-Nodes (InfiniBand)

When workloads require more than 72 GPUs, they must connect multiple NVL72 racks. Each rack maintains its own isolated NVLink domain—the copper backplane cannot extend beyond the physical cabinet. These separate NVLink domains must be bridged by a scale-out network: InfiniBand (or Ethernet).

As NVIDIA’s roadmap expands the rack-scale node (144 GPUs with Vera Rubin NVL144 in 2026, 572 GPUs with Vera Rubin Ultra NVL572 in 2027), the fundamental architecture remains the same. Domain 1 grows larger—keeping more computation within the low-latency copper or proprietary interconnect. Domain 2 (InfiniBand or Ethernet) provides the scale-out fabric connecting these progressively larger super-nodes.

For this scale-out layer, NVIDIA uses the Quantum-X800 InfiniBand switch—their fastest network switch for AI workloads.

The specifications:

Bandwidth: 800 Gbps per port (InfiniBand XDR speed)
Latency: Sub-130 nanoseconds port-to-port
Radix: 144 ports per switch
Scale: To build a non-blocking network for 576 GPUs (8 NVL72 racks), approximately 60-80 switches are required

The Rail-Optimized Topology

This is where NVIDIA’s architecture becomes radically different from conventional data center networking. Instead of building a single large network where any server can talk to any server (a “fat tree”), NVIDIA segregates the network into 72 parallel, independent networks called “Rails.”

To understand Rails, start with the scaling challenge: each NVL72 rack is an isolated 72-GPU NVLink domain. To build a cluster larger than 72 GPUs, you must deploy multiple racks and connect them via InfiniBand. Each GPU in a rack has a dedicated external network connection, allowing it to communicate with corresponding GPUs in other racks.

How Rails work:

Rail 0 connects GPU #0 from every NVL72 rack in the cluster
Rail 1 connects GPU #1 from every NVL72 rack
Rail 2 connects GPU #2 from every NVL72 rack
...and so on through Rail 71

GPU #0 never competes for bandwidth with GPU #1 (which uses Rail 1), GPU #2 (Rail 2), or any other GPU. Each GPU position within the rack has its own physically isolated network path to corresponding GPUs in all other racks.

The result: Elimination of contention. By segregating traffic based on GPU position within the rack, the Rail-Optimized topology ensures that the massive, synchronized flows of AI training never cross paths. It effectively creates 72 separate, non-blocking supercomputers operating in parallel across all racks.

As the rack-scale node grows (144 GPUs in NVL144, 572 GPUs in NVL572), the Rail principle scales proportionally. An NVL144 cluster would use 144 Rails; an NVL572 cluster would use 572 Rails. The larger the Domain 1 unit, the more Rails required—but the architecture remains the same: one Rail per GPU position, connecting corresponding GPUs across all super-nodes in the cluster.

In-Network Computing (SHARP)

The Quantum-X800 switches include a critical InfiniBand feature: SHARP v4 (Scalable Hierarchical Aggregation and Reduction Protocol). Traditional All-Reduce requires each GPU to send data to a designated “reducer” GPU, which adds the values and sends the result back. SHARP moves this arithmetic into the switch itself.

When packets arrive at the Quantum-X800, an ALU (Arithmetic Logic Unit) inside the switch performs the addition as packets pass through. Instead of requiring N hops to aggregate data from N GPUs, SHARP reduces the operation to log(N) hops. For a cluster with 8 NVL72 racks (576 GPUs total), this cuts network traffic by roughly 50% and latency by 2-3x. The switch is no longer just routing packets—it’s computing.

Scaling Across the Data Center

When clusters scale to tens of thousands of GPUs—requiring hundreds of rack-scale super-nodes—they enter additional layers of switching. Here, NVIDIA uses additional Quantum-X800 switches configured as “Spine” switches to interconnect groups of racks distributed across the data center.

The physical challenge: Racks may be 50-100 meters apart, far beyond copper cable range. The solution is optical fiber with active transceivers in the switches. Each Quantum-X800 Spine switch converts electrical signals to light, routes the optical packets, and converts back to electricity at the destination.

The latency penalty: Each optical hop adds approximately 200 nanoseconds—not much in human terms, but significant when GPUs operate at 4 GHz (one clock cycle every 0.25 nanoseconds). For operations that must traverse multiple spine switches, this latency accumulates. This is why NVIDIA’s roadmap focuses on expanding the rack-scale super-node—by keeping more GPUs within Domain 1 (72 → 144 → 572 GPUs), fewer operations need to traverse the higher-latency Domain 2 fabric.

The rail structure extends: Even at the spine level, the Rail-Optimized topology persists. Whether connecting 72-GPU racks, 144-GPU racks, or future 572-GPU super-nodes, the Rails remain physically separate all the way to the top of the hierarchy, ensuring zero contention even in 100,000-GPU clusters.

The Ethernet Alternative: Spectrum-X

While InfiniBand remains NVIDIA’s highest-performance option for dedicated AI supercomputers, they also offer Spectrum-X Ethernet for customers integrating AI into existing cloud infrastructures.

Spectrum-X uses standard Ethernet cabling but modifies protocol behavior to mimic InfiniBand’s lossless characteristics. It employs:

RoCE v2 (RDMA over Converged Ethernet) for direct memory access
Adaptive Routing that dynamically sprays packets across all available paths

This allows AI traffic to coexist with traditional cloud workloads while maintaining 95% effective bandwidth utilization—approaching InfiniBand performance while preserving Ethernet’s ubiquity and interoperability.

Software Orchestration: NCCL (Dynamic Discovery)

NVIDIA’s communication library, NCCL (NVIDIA Collective Communications Library), is designed for the flexibility that hierarchical networks require.

When an NVIDIA cluster powers on, NCCL performs dynamic discovery: it “looks around” to see which GPUs are available, measures the network topology, and builds an optimal communication tree on the fly. If one of the thousands of switches in a massive cluster fails, NCCL detects the failure and dynamically reroutes traffic around it. The cluster continues operating, potentially at slightly reduced performance, rather than halting entirely.

This flexibility comes at a cost: NCCL’s runtime overhead. Every communication operation requires negotiation—checking which path is optimal, managing buffers, handling unexpected congestion. For stable, predictable workloads, this overhead is pure inefficiency. But for real-world deployments where hardware fails, cables get unplugged, and maintenance windows require partial cluster shutdowns, NCCL’s resilience is essential.

4. Google’s Implementation: The Flat Optical Mesh

Google’s network architecture rejects hierarchy entirely—at least within a Pod. Instead of building layers of progressively slower networks, they group thousands of chips into a single, unified supercomputer called a “Pod.” Inside this Pod, a massive optical fabric ensures that every TPU has identical bandwidth and latency to every other TPU. By flattening the topology, Google creates a grid where thousands of chips can communicate as if they were all next to each other.

Domain 1: The Pod as Super-Node (ICI)

As introduced in Article 5, Google’s Inter-Chip Interconnect (ICI) creates a 3D Torus topology. Each TPU connects directly to six neighbors—north, south, east, west, up, and down—via dedicated 600 GB/s optical links.

The key property: Uniform distance. In a 3D Torus, every chip is equidistant from every other chip in terms of network hops. A TPU in position (0,0,0) reaches a TPU in position (32,32,16) through exactly the same number of hops as it would reach (16,16,32). There are no “fast” connections near the “center” and no “slow” connections at the “edge”—because there is no center or edge.

Circuit-switched, not packet-switched: Unlike NVIDIA’s InfiniBand network, which routes packets dynamically based on congestion, Google’s ICI uses optical circuit switching. At the start of a training run, the Optical Circuit Switches (OCS) physically rotate thousands of tiny MEMS mirrors to create dedicated optical paths between specific TPU pairs. Once these paths are established, data flows as continuous light beams without packet headers, routing decisions, or buffering.

The performance implication: Google’s 600 GB/s per-chip bandwidth is lower than NVIDIA’s 1.8 TB/s NVLink. However, the circuit-switched nature means there is zero packet overhead—no headers, no routing lookups, no buffering delays. The full 600 GB/s is available bandwidth, and latency is deterministic: exactly the same number of hops from any chip to any other.

The Optical Circuit Switch (OCS): Enabling the Flat Mesh

The physical device enabling Google’s flat mesh is the Optical Circuit Switch, internally codenamed “Palomar.” This is one of Google’s most closely guarded hardware innovations.

The technology: A Palomar OCS is a 136×136 port switch containing thousands of tiny MEMS (Micro-Electro-Mechanical Systems) mirrors. Each mirror is roughly the width of a human hair and can rotate on a microscopic gimbal. When a beam of light enters the OCS, a mirror physically redirects that beam to one of 136 output ports, creating a direct optical connection.

No electrical conversion: In the OCS, photons enter as light, bounce off mirrors, and exit as light. This eliminates the ~200 nanosecond penalty of optical-electrical-optical conversion and saves power (no lasers needed in the switch itself).

Reconfigurability: The MEMS mirrors can physically rotate to new positions in milliseconds. This means Google can reconfigure the network topology between training runs. If one experiment requires a Torus topology and another requires a Dragonfly topology, the mirrors rotate, and the network physically becomes that topology. The hardware is programmable at the optical layer.

The Apollo fabric layer: Multiple Palomar OCS units are arranged into the “Apollo” optical switching platform. Apollo acts as a building-scale reconfigurable patch panel, connecting thousands of TPU server trays into the desired mesh topology. For a 9,216-chip TPU v7 Pod, hundreds of OCS units work in concert to create the 3D Torus mesh.

The Pod vs. The Rack: Optical Freedom vs. Copper Constraints

To understand Google’s architecture, one must distinguish between NVIDIA’s physically-based Domain 1 and Google’s optically-based Domain 1.

NVIDIA’s Domain 1: The Physical Rack

For NVIDIA, the fundamental building block of Domain 1 is the copper-based rack. This is a hard physical boundary defined by copper physics. The 72 GPUs inside are bound together by a copper backplane with a maximum range of ~1 meter. You cannot dynamically decide to make a “100-GPU Rack” or a “50-GPU Rack” without physically rewiring the hardware.

The rack is fixed. Once manufactured, its boundaries are immutable.

Google’s Domain 1: The Optical Pod

For Google, the fundamental building block of Domain 1 is the optically-based Pod. A Pod is the maximum ICI-connected configuration—up to 9,216 chips for TPU v7 (also called a “SuperPod” in some Google documentation).

This maximum is determined by the physical limits of the Optical Circuit Switch fabric: the reach of optical fibers and the port count of the OCS units. Because the cabling is optical (fiber) rather than electrical (copper), distance is not the limiting factor that constrains NVIDIA—but there is still a practical maximum to how many chips can be woven into a single deterministic mesh.

Physical flexibility: The Pod can span multiple rows of physical racks. The OCS mirrors define which TPUs connect to which, regardless of their physical location in the data center—as long as they’re within fiber range.

Configurable domains: Google can provision compute domains of various sizes from the Pod (64 chips, 512 chips, 4,096 chips, etc.) by programming the OCS mirrors. The same physical hardware can be reconfigured to serve different topologies.

Isolation: Each provisioned domain is optically isolated. Traffic within one domain never contends with traffic from other domains, even if they share the same physical racks.

Where NVIDIA scales by deploying more physical racks, Google scales by connecting more TPUs into larger optical Pods, then provisioning domains of appropriate sizes from that hardware pool.

Domain 2: Scaling Beyond the Pod (Multislice)

While the OCS allows for massive Pods, there is a physical limit to how many chips can be connected in a single low-latency ICI mesh—currently 9,216 chips for TPU v7. This limit is determined by the optical fiber reach and the port count of the Optical Circuit Switches. Within this boundary, every chip communicates via the ultra-fast, proprietary ICI mesh.

To scale beyond 9,216 chips, Google employs Multislice—a technology that connects multiple full Pods together. For example, connecting two TPU v7 Pods via Multislice creates an 18,432-chip system. The largest Multislice deployments (reportedly used for training Gemini Ultra) have spanned tens of thousands of chips, requiring multiple fully-populated Pods connected together.

This architecture bifurcates traffic into two domains:

Intra-Pod (Domain 1): Traffic remains on the deterministic ICI fabric (the 3D Torus). Here, XLA’s compile-time scheduling ensures perfect synchronization.

Inter-Pod (Domain 2): Traffic traverses the Jupiter Data Center Network (DCN). This is standard Ethernet with Google’s Swift protocol for enhanced congestion control.

This functionally mirrors NVIDIA’s spine layer. While local traffic within a Pod enjoys the “clockwork” precision of optical circuits, traffic between Pods enters a standard packet-switched network that behaves more like the traditional internet—dynamic and subject to minor jitter. This introduces the two-domain structure we saw earlier: absolute determinism within the Pod, and managed dynamism between Pods.

Jupiter’s role: Jupiter is Google’s unified data center network architecture. It connects not just TPU Pods to each other (in Multislice configurations), but also TPUs to Google’s massive storage infrastructure (Colossus), CPU fleets, and external internet gateways. By using standard Ethernet frames (with customized protocols), Jupiter enables interoperability across Google’s entire infrastructure.

Software Orchestration: XLA (Deterministic Clockwork)

Google’s compiler, XLA (Accelerated Linear Algebra), takes a radically different approach from NVIDIA’s NCCL. XLA assumes perfect knowledge of the network topology and perfect reliability.

Compile-time scheduling: When a JAX program is compiled, XLA analyzes the computation graph and the physical TPU mesh topology. It calculates the exact nanosecond when each piece of data will leave Chip A and arrive at Chip B. There is no runtime negotiation, no dynamic routing, no “checking if the path is clear.” The schedule is computed once, at compile time, and executed blindly.

No handshakes: In NCCL, a sender waits for acknowledgment before transmitting. In XLA, senders transmit without waiting. They know—mathematically—that the receiver will be ready because XLA scheduled the receiver’s computation to complete exactly when the data arrives. This “clockwork execution” eliminates all handshake overhead but requires absolute predictability.

The trade-off: If a TPU fails mid-computation, XLA cannot dynamically reroute. The entire training job typically must restart from the last checkpoint. Google accepts this trade-off because their infrastructure is designed for extremely high reliability (more on this in Section 6), and the performance gains from eliminating all runtime overhead outweigh the cost of occasional restarts.

5. The Physics of Data Movement: Massive Flows and Synchronization

To understand why these radically different topologies exist, we must examine the traffic patterns they’re designed to handle. AI training doesn’t generate random, bursty traffic like a web server. It generates synchronized, massive data transfers.

The All-Reduce Storm

The fundamental operation in data-parallel training is All-Reduce. At the end of each training step, every GPU has computed gradients (updates to the model weights) based on its batch of data. To calculate the true average gradient, every GPU must share its results with every other GPU.

The traffic pattern:

Volume: For a trillion-parameter model, each All-Reduce operation moves terabytes of data
Timing: It happens simultaneously across all chips—every GPU sends data at the exact same millisecond
Synchronization: Training cannot proceed to the next step until every GPU has received every other GPU’s gradients

This is fundamentally different from traditional networking workloads. There’s no “bursty” traffic—it’s a continuous drumbeat of synchronized massive flows. Every few milliseconds, the entire supercomputer pauses to perform All-Reduce, then resumes computation.

Tail Latency: The Straggler Problem

In a synchronized system, performance is determined by the slowest operation, not the average operation. If 99,999 packets arrive in 10 microseconds but a single packet takes 10 milliseconds because it got stuck in a switch buffer, the entire 100,000-GPU cluster halts for 10 milliseconds waiting for that straggler.

Congestion Control: Different Philosophies

Both companies have engineered solutions to eliminate tail latency, but through opposite approaches.

NVIDIA (Reactive): InfiniBand uses credit-based flow control. A sender cannot transmit until the receiver has explicitly signaled that buffer space is available. This makes InfiniBand “lossless by design”—packet drops are physically impossible. When congestion occurs, the network pushes back against senders, forcing them to slow down.

The Quantum-X800 adds adaptive routing on top of this: when one path gets congested, traffic automatically sprays across alternative paths. Individual packets may take different routes, but they all arrive reliably, and the receiving NIC reassembles them in order.

Result: NVIDIA’s approach handles congestion reactively. The network continuously monitors its own state and adjusts routing dynamically to avoid bottlenecks. This works even when traffic patterns are unpredictable or when hardware components are operating at different speeds due to thermal throttling or partial failures.

Google (Preventative): The ICI fabric doesn’t have “congestion” in the traditional sense because communication is scheduled at compile time. XLA knows exactly which chips will communicate when, and it schedules operations such that no two transfers ever collide on the same optical circuit.

If the compiler cannot find a valid schedule (because the requested communication pattern would cause congestion), the program fails to compile rather than failing at runtime. The developer must redesign the algorithm to fit the hardware’s communication capacity.

Result: Google’s approach prevents congestion through perfect planning. The network never encounters unexpected traffic because every data movement was calculated in advance. This requires predictable workloads and stable hardware but delivers maximum efficiency when those conditions are met.

6. Two Network Operations Doctrines

The architectural differences between NVIDIA and Google’s networks create fundamentally different operational philosophies. How do you keep a supercomputer running when components inevitably fail?

NVIDIA: “Detect & Adapt”

NVIDIA’s approach is built on the assumption that hardware will fail, and the system must gracefully adapt.

The architecture enables flexibility: Because InfiniBand uses dynamic routing and NCCL performs runtime discovery, the system can route around failures. When a switch fails, NCCL detects it (usually within seconds) and rebuilds the communication tree using alternative paths.

Switch failure: If a Quantum-X800 switch fails, NCCL dynamically reroutes traffic through alternative switches. The cluster continues operating, potentially with reduced bandwidth on certain paths, but without halting.

Rack failure: If an entire NVL72 rack fails (power outage, cooling failure, etc.), the cluster can isolate that rack and continue training with the remaining racks. For a data-parallel workload trained across 1,000 racks, losing one rack means restarting from the last checkpoint with 999 racks—annoying but not catastrophic.

Cable failure: Individual cable failures are detected automatically. NCCL marks the failed path as unavailable and routes around it. Cables can be replaced during maintenance windows without shutting down the entire cluster.

The cost: Performance variability. With dynamic routing and rerouting, job-to-job performance varies based on which specific hardware components are currently operational. A training run might take 10% longer this week than last week because two switches are down for maintenance. NVIDIA accepts this variability in exchange for continuous operation.

Monitoring and telemetry: NVIDIA’s infrastructure relies heavily on runtime monitoring. Every switch, cable, and NIC continuously reports health metrics. When anomalies are detected (increased error rates, higher-than-expected latency), the system can proactively isolate potentially failing components before they cause job failures.

Google: “Predict & Purge”

Google’s approach assumes that with sufficient care, hardware won’t fail—and when it does, you remove it before it causes problems.

The architecture requires perfection: Because XLA schedules communication down to the nanosecond, a single “slow” chip (not even broken, just lagging due to thermal issues) breaks the global clockwork. All chips must operate in perfect synchrony, or the deterministic schedule collapses.

Aggressive telemetry: Google’s management software (Borg/GKE) constantly monitors error rates, thermal variance, and performance metrics. If a chip shows pre-failure symptoms—slightly elevated error rates, minor thermal throttling, inconsistent latency—the system proactively evicts the workload from that Pod or migrates it to healthy hardware.

Proactive replacement: Rather than waiting for components to fail, Google uses telemetry to predict failures. A TPU showing signs of degradation is removed from service during scheduled maintenance and replaced before it impacts production workloads.

Frequent checkpointing: Training jobs checkpoint every few minutes. When a failure occurs (or when a component is proactively removed), the job restarts from the most recent checkpoint, losing only minutes of work. The cost of restarting is low enough that dynamic rerouting is unnecessary.

The benefit: Predictable performance. Every training run on a given Pod configuration achieves identical performance because there’s no dynamic routing introducing variability. This makes capacity planning straightforward and performance debugging easier—if a job is slower than expected, it’s a software problem, not a hardware configuration issue.

7. Which Network Topology Is Better?

Both architectures successfully train trillion-parameter models, but they optimize for different values and constraints.

NVIDIA’s hierarchical, resilient approach is ideal for:

Multi-tenant cloud environments where diverse customers run varied workloads with different scaling requirements
Organizations with varying operational capabilities that may not maintain Google-level infrastructure discipline
Workloads requiring flexibility where jobs must adapt to partial cluster availability or hardware heterogeneity
Incremental scaling where infrastructure grows gradually (72 → 144 → 572 GPUs) rather than in massive Pods

Google’s flat mesh, deterministic approach is ideal for:

Single-tenant research environments training frontier models where the entire cluster serves one purpose
Organizations with infrastructure maturity capable of maintaining ultra-high reliability through proactive management
Workloads demanding performance predictability where consistent iteration time accelerates research progress
Large-scale deployments where provisioning entire 9,216-chip Pods makes economic sense

Each architecture reflects different engineering values:

NVIDIA values resilience and flexibility—the network must work in messy, real-world conditions with imperfect hardware
Google values efficiency and predictability—the network operates as deterministic clockwork, assuming infrastructure excellence

Both approaches have successfully trained the world’s largest models. The choice depends not on technical superiority but on organizational fit.

8. Conclusion: The Two-Tier Reality

The network topologies engineered by NVIDIA and Google represent the two dominant philosophies of the AI era: NVIDIA’s resilient hierarchy versus Google’s deterministic mesh. Yet, as models scale beyond 100,000 chips, physics is forcing a structural similarity that neither company explicitly planned. Both companies have discovered that a trillion-parameter cluster requires two distinct domains:

Domain 1: The “Super-Node” Fabric - A massive, proprietary, ultra-low-latency island where compute is tightly coupled.

For Google, this is the Pod (up to 9,216 TPUs in TPU v7). Inside this boundary, the optical mesh creates a deterministic, flat “bubble” of perfect synchronization. Every chip is equidistant from every other, and XLA schedules communication with nanosecond precision.

For NVIDIA, this is the Rack (72 to 572 GPUs connected via NVLink). By moving to a copper backplane, NVIDIA has essentially turned the rack into a single giant GPU, mimicking the tight coupling of a Google Pod—just at smaller scale with different trade-offs.

Domain 2: The Scale-Out Fabric - A standard, packet-switched network to connect these islands.

For Google, this is Multislice (Jupiter). They have conceded that the flat mesh cannot scale infinitely. To grow beyond 9,216 chips, they must introduce hierarchy, connecting Pods via standard data center networking. Traffic between Pods uses packet-switched Ethernet, entering a world of dynamic routing and managed congestion—exactly what they avoided within the Pod.

For NVIDIA, this is the InfiniBand/Ethernet fabric with Rail-Optimized topology. They use this to bridge their massive-scale racks, employing SHARP in-network computing and adaptive routing to maintain high performance across building-scale distances.

Both have accepted the two-domain structure (super-node + scale-out). Google’s flat mesh uses a hierarchy to scale beyond the Pod (Multislice). NVIDIA’s hierarchy uses massive “flat” nodes to scale (NVL72/144/572). Zoom out, and we see huge islands of determinism connected by oceans of dynamic networking.

A fundamental philosophical divide persists: hierarchical versus flat, dynamic versus deterministic, resilient versus efficient.

NVIDIA believes the network should adapt to the hardware’s imperfections. Google believes the hardware should be perfectly configured by software.

These are not merely technical choices—they reflect different assumptions about how a supercomputer should be built and operated.

In the next article, “Parallelism: The Blueprint of Training”, we will leave the physical layer and move up the stack. We will examine how these topological choices dictate the specific parallelism strategies for training trillion-parameter models.

Jan 9Edited

I just want to say I enjoyed the reading for all available 6 episodes at this point. Best series that put the two camps of idea/architecture side by side for comparison and more importantly to understand the reasons behind the design choices.

The AI Architect

Dec 11

Exceptional breakdown. The Rail-Optimized topology insight is underrated because most people still think about AI networking through datacenter lens when it's fundamentally diferent. What struck me is how NVIDIA's roadmap (72 to 572 GPUs per rack) is basically them admitting that the less traffic hits Domain 2, teh better, which kinda validates Google's "make Domain 1 huge" strategy from the opposite direction.

The Gigawatt Machine

Discussion about this post

Ready for more?