The Gigawatt Machine

Two Network Topologies: Hierarchical vs. Flat

Tony Wan — Thu, 11 Dec 2025 00:50:18 GMT

1. Introduction: Beyond the Node

In the previous articles, we explored how NVIDIA and Google define the “compute node”—NVIDIA through progressive rack-scale integration (72 to 572 GPUs) and Google through configurable Pods (up to 8,960 TPUs). Now we must confront what happens when you leave the node.

A trillion-parameter model cannot fit within any single node, no matter how large. Training GPT-5 or Gemini requires tens of thousands of chips working in perfect synchrony across hundreds of racks distributed throughout a data center. The challenge is not computational—each chip knows how to multiply matrices—but communications: how do you move petabytes of data per second across a building-scale fabric without turning the network into a bottleneck?

The physics of this problem forced both companies to abandon the traditional “fat tree” network topology used in cloud data centers. Instead, they engineered fundamentally different solutions:

NVIDIA’s approach: The Hierarchical Federation. A cluster is a federation of ultra-dense rack-scale supercomputers (NVL72, and in the future NVL144 and NVL572 nodes) connected by layers of electrical switches arranged in a Rail-Optimized topology. Data moves through a three-tier hierarchy: copper NVLink inside the rack, InfiniBand between racks within a pod, and inter-pod networking at the data center scale.

Google’s approach: The Flat Optical Mesh. A cluster is a single, massive optical fabric where every TPU connects to its six neighbors through dedicated circuit-switched paths. There is no hierarchy—the “network” is just the 3D Torus mesh of ICI links, with Optical Circuit Switches (OCS) dynamically reconfiguring the topology as needed.

This article examines how these two topologies handle the traffic patterns of AI training, and the traffic storms that would collapse a conventional network—and what happens when hardware fails in a supercomputer.

2. The Two-Domain Architecture: Super-Node vs. Scale-Out

Before diving into each company’s specific implementation, we must understand a fundamental pattern that has emerged in both architectures: the bifurcation of networking into two distinct domains.

As models scale beyond 100,000 chips, physics forces a structural reality that neither company explicitly planned but both discovered independently. A trillion-parameter cluster requires two domains with fundamentally different characteristics.

Domain 1: The “Super-Node” Fabric (Proprietary)

This is the domain of maximum bandwidth and tightest coupling. Here, the “network” acts less like a cable and more like a motherboard bus. Chips in this domain communicate as if they were all on the same die.

Defining characteristics:

Proprietary technology: Custom interconnects optimized for minimum latency
Tight coupling: Sub-microsecond latency, often with hardware-coherent memory
Physical constraints: Limited by the physics of the interconnect medium (copper range or optical circuit-switch scale)
Traffic type: Handles the most bandwidth-intensive operations that require instant synchronization

The strategic purpose: Keep the highest-traffic communication patterns—specifically operations like Tensor Parallelism, where a single matrix multiply is split across multiple chips—entirely within this domain. By doing so, these operations never touch the slower external network.

NVIDIA’s implementation: The GB200 NVL72 rack, where 72 GPUs connect via a copper backplane using NVLink 5.0, creating a 130 TB/s shared-memory domain. The rack’s physical boundaries (roughly 1 meter) are defined by copper signal integrity limits.

Google’s implementation: The TPU Pod, where thousands of chips (up to 9,216 in TPU v7) connect via the ICI optical mesh with Optical Circuit Switches. Unlike NVIDIA’s copper limitation, optical fibers allow the Pod to span multiple physical racks—though Google can provision domains of various sizes from the Pod. This flexibility comes at the cost of requiring compile-time scheduling rather than hardware coherence.

Domain 2: The Scale-Out Fabric (Standard)

When a workload exceeds the boundaries of Domain 1, it must cross into the external network. Scaling the proprietary super-node fabric across an entire building is impractical due to signal degradation, cabling complexity, and the operational risks of creating a single massive fault domain.

Defining characteristics:

Standard protocols: Based on industry standards (InfiniBand, Ethernet) for interoperability
Packet-switched: Dynamic routing, buffering, and congestion control
Building-scale: Can span hundreds of meters using optical transceivers
Multi-purpose: Must also connect to storage, CPUs, and external networks

The strategic purpose: Connect the super-node islands together, bridge to storage systems, and provide access to the outside world. Performance here matters, but flexibility and interoperability matter more.

NVIDIA’s implementation: InfiniBand (Quantum-X800) or Converged Ethernet (Spectrum-X), using Rail-Optimized topologies and in-network computing (SHARP) to maintain high performance while providing resilience through dynamic routing.

Google’s implementation: Jupiter Data Center Network, using standard Ethernet frames with customized protocols (Swift) for precise timing. This layer connects TPU Slices in Multislice configurations and bridges to Google’s massive storage infrastructure.

The Convergence Nobody Planned

Both companies have arrived at a similar two-domain structure through completely different paths:

Google started with a flat mesh (the ICI fabric) within a single Pod. To scale beyond the Pod maximum (9,216 chips for TPU v7), they developed Multislice—technology that connects multiple Pods via packet-switched Ethernet (Jupiter DCN). This created Domain 2.

NVIDIA started with a hierarchy (discrete servers connected by networks) and is aggressively expanding Domain 1 to encompass more compute. The NVL72 rack (and future NVL144, NVL572) turns what used to require networking into a single tightly-coupled node.

Both topologies now have two domains. Google’s flat mesh uses scale-out networking to grow beyond the Pod. NVIDIA’s hierarchy uses massive “flat” nodes to reduce networking. Zoom out, and we see huge islands of determinism connected by oceans of dynamic networking.

The philosophical difference remains: NVIDIA believes the network should adapt to hardware imperfections. Google believes the hardware should be perfectly configured by software. But the two-domain structure—proprietary super-nodes plus standard scale-out—has become the inevitable architecture of the Gigawatt era.

With this framework established, we can now examine how each company implements these two domains in practice.

3. NVIDIA’s Implementation: The Hierarchical Federation

NVIDIA’s architecture treats the supercomputer as a hierarchy of networks, each layer optimized for different bandwidth requirements and traffic patterns. Understanding this hierarchy requires examining each layer and how they interconnect.

Domain 1: The Rack as Super-Node (NVLink)

As established in Article 5, the GB200 NVL72 rack contains 72 GPUs connected by a massive copper backplane. This creates a 130 TB/s all-to-all fabric using NVLink 5.0, where every GPU can talk to every other GPU in the rack at full speed through passive copper traces.

NVIDIA’s roadmap expands this rack-scale integration: the Vera Rubin NVL144 (2026) will integrate 144 GPUs, and the Vera Rubin Ultra NVL572 (2027) will reach 572 GPUs—all within a single tightly-coupled super-node. This progressive expansion of Domain 1 is NVIDIA’s strategy for reducing dependency on the external network: the larger the rack-scale unit, the more computation stays within the ultra-low-latency domain.

Why copper: At these bandwidths (1.8 TB/s per GPU), electrical signals degrade rapidly. The NVL72 rack is engineered to keep all 72 GPUs within the maximum distance that passive copper can handle—roughly 1 meter. This avoids the latency and power tax of optical transceivers. (Saves ~200 nanoseconds per hop converting between light and electricity. Saves ~20 kW per rack in power consumption.)

The strategic implication: NVIDIA architected the rack-scale super-node to handle the most bandwidth-intensive operations—specifically Tensor Parallelism, where a single matrix operation is split across multiple GPUs—entirely within the copper domain. As the roadmap progresses from 72 → 144 → 572 GPUs, it’s possible to keep the highest-traffic operations for even trillion-parameter models within Domain 1, avoiding the external network entirely for the most latency-sensitive work.

The result: To the software, the entire rack looks like one giant GPU. NCCL (NVIDIA’s communication library) sees the rack as a single shared-memory domain where communication is instantaneous and lossless.

Domain 2: Connecting the Super-Nodes (InfiniBand)

When workloads require more than 72 GPUs, they must connect multiple NVL72 racks. Each rack maintains its own isolated NVLink domain—the copper backplane cannot extend beyond the physical cabinet. These separate NVLink domains must be bridged by a scale-out network: InfiniBand (or Ethernet).

As NVIDIA’s roadmap expands the rack-scale node (144 GPUs with Vera Rubin NVL144 in 2026, 572 GPUs with Vera Rubin Ultra NVL572 in 2027), the fundamental architecture remains the same. Domain 1 grows larger—keeping more computation within the low-latency copper or proprietary interconnect. Domain 2 (InfiniBand or Ethernet) provides the scale-out fabric connecting these progressively larger super-nodes.

For this scale-out layer, NVIDIA uses the Quantum-X800 InfiniBand switch—their fastest network switch for AI workloads.

The specifications:

Bandwidth: 800 Gbps per port (InfiniBand XDR speed)
Latency: Sub-130 nanoseconds port-to-port
Radix: 144 ports per switch
Scale: To build a non-blocking network for 576 GPUs (8 NVL72 racks), approximately 60-80 switches are required

The Rail-Optimized Topology

This is where NVIDIA’s architecture becomes radically different from conventional data center networking. Instead of building a single large network where any server can talk to any server (a “fat tree”), NVIDIA segregates the network into 72 parallel, independent networks called “Rails.”

To understand Rails, start with the scaling challenge: each NVL72 rack is an isolated 72-GPU NVLink domain. To build a cluster larger than 72 GPUs, you must deploy multiple racks and connect them via InfiniBand. Each GPU in a rack has a dedicated external network connection, allowing it to communicate with corresponding GPUs in other racks.

How Rails work:

Rail 0 connects GPU #0 from every NVL72 rack in the cluster
Rail 1 connects GPU #1 from every NVL72 rack
Rail 2 connects GPU #2 from every NVL72 rack
...and so on through Rail 71

GPU #0 never competes for bandwidth with GPU #1 (which uses Rail 1), GPU #2 (Rail 2), or any other GPU. Each GPU position within the rack has its own physically isolated network path to corresponding GPUs in all other racks.

The result: Elimination of contention. By segregating traffic based on GPU position within the rack, the Rail-Optimized topology ensures that the massive, synchronized flows of AI training never cross paths. It effectively creates 72 separate, non-blocking supercomputers operating in parallel across all racks.

As the rack-scale node grows (144 GPUs in NVL144, 572 GPUs in NVL572), the Rail principle scales proportionally. An NVL144 cluster would use 144 Rails; an NVL572 cluster would use 572 Rails. The larger the Domain 1 unit, the more Rails required—but the architecture remains the same: one Rail per GPU position, connecting corresponding GPUs across all super-nodes in the cluster.

In-Network Computing (SHARP)

The Quantum-X800 switches include a critical InfiniBand feature: SHARP v4 (Scalable Hierarchical Aggregation and Reduction Protocol). Traditional All-Reduce requires each GPU to send data to a designated “reducer” GPU, which adds the values and sends the result back. SHARP moves this arithmetic into the switch itself.

When packets arrive at the Quantum-X800, an ALU (Arithmetic Logic Unit) inside the switch performs the addition as packets pass through. Instead of requiring N hops to aggregate data from N GPUs, SHARP reduces the operation to log(N) hops. For a cluster with 8 NVL72 racks (576 GPUs total), this cuts network traffic by roughly 50% and latency by 2-3x. The switch is no longer just routing packets—it’s computing.

Scaling Across the Data Center

When clusters scale to tens of thousands of GPUs—requiring hundreds of rack-scale super-nodes—they enter additional layers of switching. Here, NVIDIA uses additional Quantum-X800 switches configured as “Spine” switches to interconnect groups of racks distributed across the data center.

The physical challenge: Racks may be 50-100 meters apart, far beyond copper cable range. The solution is optical fiber with active transceivers in the switches. Each Quantum-X800 Spine switch converts electrical signals to light, routes the optical packets, and converts back to electricity at the destination.

The latency penalty: Each optical hop adds approximately 200 nanoseconds—not much in human terms, but significant when GPUs operate at 4 GHz (one clock cycle every 0.25 nanoseconds). For operations that must traverse multiple spine switches, this latency accumulates. This is why NVIDIA’s roadmap focuses on expanding the rack-scale super-node—by keeping more GPUs within Domain 1 (72 → 144 → 572 GPUs), fewer operations need to traverse the higher-latency Domain 2 fabric.

The rail structure extends: Even at the spine level, the Rail-Optimized topology persists. Whether connecting 72-GPU racks, 144-GPU racks, or future 572-GPU super-nodes, the Rails remain physically separate all the way to the top of the hierarchy, ensuring zero contention even in 100,000-GPU clusters.

The Ethernet Alternative: Spectrum-X

While InfiniBand remains NVIDIA’s highest-performance option for dedicated AI supercomputers, they also offer Spectrum-X Ethernet for customers integrating AI into existing cloud infrastructures.

Spectrum-X uses standard Ethernet cabling but modifies protocol behavior to mimic InfiniBand’s lossless characteristics. It employs:

RoCE v2 (RDMA over Converged Ethernet) for direct memory access
Adaptive Routing that dynamically sprays packets across all available paths

This allows AI traffic to coexist with traditional cloud workloads while maintaining 95% effective bandwidth utilization—approaching InfiniBand performance while preserving Ethernet’s ubiquity and interoperability.

Software Orchestration: NCCL (Dynamic Discovery)

NVIDIA’s communication library, NCCL (NVIDIA Collective Communications Library), is designed for the flexibility that hierarchical networks require.

When an NVIDIA cluster powers on, NCCL performs dynamic discovery: it “looks around” to see which GPUs are available, measures the network topology, and builds an optimal communication tree on the fly. If one of the thousands of switches in a massive cluster fails, NCCL detects the failure and dynamically reroutes traffic around it. The cluster continues operating, potentially at slightly reduced performance, rather than halting entirely.

This flexibility comes at a cost: NCCL’s runtime overhead. Every communication operation requires negotiation—checking which path is optimal, managing buffers, handling unexpected congestion. For stable, predictable workloads, this overhead is pure inefficiency. But for real-world deployments where hardware fails, cables get unplugged, and maintenance windows require partial cluster shutdowns, NCCL’s resilience is essential.

4. Google’s Implementation: The Flat Optical Mesh

Google’s network architecture rejects hierarchy entirely—at least within a Pod. Instead of building layers of progressively slower networks, they group thousands of chips into a single, unified supercomputer called a “Pod.” Inside this Pod, a massive optical fabric ensures that every TPU has identical bandwidth and latency to every other TPU. By flattening the topology, Google creates a grid where thousands of chips can communicate as if they were all next to each other.

Domain 1: The Pod as Super-Node (ICI)

As introduced in Article 5, Google’s Inter-Chip Interconnect (ICI) creates a 3D Torus topology. Each TPU connects directly to six neighbors—north, south, east, west, up, and down—via dedicated 600 GB/s optical links.

The key property: Uniform distance. In a 3D Torus, every chip is equidistant from every other chip in terms of network hops. A TPU in position (0,0,0) reaches a TPU in position (32,32,16) through exactly the same number of hops as it would reach (16,16,32). There are no “fast” connections near the “center” and no “slow” connections at the “edge”—because there is no center or edge.

Circuit-switched, not packet-switched: Unlike NVIDIA’s InfiniBand network, which routes packets dynamically based on congestion, Google’s ICI uses optical circuit switching. At the start of a training run, the Optical Circuit Switches (OCS) physically rotate thousands of tiny MEMS mirrors to create dedicated optical paths between specific TPU pairs. Once these paths are established, data flows as continuous light beams without packet headers, routing decisions, or buffering.

The performance implication: Google’s 600 GB/s per-chip bandwidth is lower than NVIDIA’s 1.8 TB/s NVLink. However, the circuit-switched nature means there is zero packet overhead—no headers, no routing lookups, no buffering delays. The full 600 GB/s is available bandwidth, and latency is deterministic: exactly the same number of hops from any chip to any other.

The Optical Circuit Switch (OCS): Enabling the Flat Mesh

The physical device enabling Google’s flat mesh is the Optical Circuit Switch, internally codenamed “Palomar.” This is one of Google’s most closely guarded hardware innovations.

The technology: A Palomar OCS is a 136×136 port switch containing thousands of tiny MEMS (Micro-Electro-Mechanical Systems) mirrors. Each mirror is roughly the width of a human hair and can rotate on a microscopic gimbal. When a beam of light enters the OCS, a mirror physically redirects that beam to one of 136 output ports, creating a direct optical connection.

No electrical conversion: In the OCS, photons enter as light, bounce off mirrors, and exit as light. This eliminates the ~200 nanosecond penalty of optical-electrical-optical conversion and saves power (no lasers needed in the switch itself).

Reconfigurability: The MEMS mirrors can physically rotate to new positions in milliseconds. This means Google can reconfigure the network topology between training runs. If one experiment requires a Torus topology and another requires a Dragonfly topology, the mirrors rotate, and the network physically becomes that topology. The hardware is programmable at the optical layer.

The Apollo fabric layer: Multiple Palomar OCS units are arranged into the “Apollo” optical switching platform. Apollo acts as a building-scale reconfigurable patch panel, connecting thousands of TPU server trays into the desired mesh topology. For a 9,216-chip TPU v7 Pod, hundreds of OCS units work in concert to create the 3D Torus mesh.

The Pod vs. The Rack: Optical Freedom vs. Copper Constraints

To understand Google’s architecture, one must distinguish between NVIDIA’s physically-based Domain 1 and Google’s optically-based Domain 1.

NVIDIA’s Domain 1: The Physical Rack

For NVIDIA, the fundamental building block of Domain 1 is the copper-based rack. This is a hard physical boundary defined by copper physics. The 72 GPUs inside are bound together by a copper backplane with a maximum range of ~1 meter. You cannot dynamically decide to make a “100-GPU Rack” or a “50-GPU Rack” without physically rewiring the hardware.

The rack is fixed. Once manufactured, its boundaries are immutable.

Google’s Domain 1: The Optical Pod

For Google, the fundamental building block of Domain 1 is the optically-based Pod. A Pod is the maximum ICI-connected configuration—up to 9,216 chips for TPU v7 (also called a “SuperPod” in some Google documentation).

This maximum is determined by the physical limits of the Optical Circuit Switch fabric: the reach of optical fibers and the port count of the OCS units. Because the cabling is optical (fiber) rather than electrical (copper), distance is not the limiting factor that constrains NVIDIA—but there is still a practical maximum to how many chips can be woven into a single deterministic mesh.

Physical flexibility: The Pod can span multiple rows of physical racks. The OCS mirrors define which TPUs connect to which, regardless of their physical location in the data center—as long as they’re within fiber range.

Configurable domains: Google can provision compute domains of various sizes from the Pod (64 chips, 512 chips, 4,096 chips, etc.) by programming the OCS mirrors. The same physical hardware can be reconfigured to serve different topologies.

Isolation: Each provisioned domain is optically isolated. Traffic within one domain never contends with traffic from other domains, even if they share the same physical racks.

Where NVIDIA scales by deploying more physical racks, Google scales by connecting more TPUs into larger optical Pods, then provisioning domains of appropriate sizes from that hardware pool.

Domain 2: Scaling Beyond the Pod (Multislice)

While the OCS allows for massive Pods, there is a physical limit to how many chips can be connected in a single low-latency ICI mesh—currently 9,216 chips for TPU v7. This limit is determined by the optical fiber reach and the port count of the Optical Circuit Switches. Within this boundary, every chip communicates via the ultra-fast, proprietary ICI mesh.

To scale beyond 9,216 chips, Google employs Multislice—a technology that connects multiple full Pods together. For example, connecting two TPU v7 Pods via Multislice creates an 18,432-chip system. The largest Multislice deployments (reportedly used for training Gemini Ultra) have spanned tens of thousands of chips, requiring multiple fully-populated Pods connected together.

This architecture bifurcates traffic into two domains:

Intra-Pod (Domain 1): Traffic remains on the deterministic ICI fabric (the 3D Torus). Here, XLA’s compile-time scheduling ensures perfect synchronization.

Inter-Pod (Domain 2): Traffic traverses the Jupiter Data Center Network (DCN). This is standard Ethernet with Google’s Swift protocol for enhanced congestion control.

This functionally mirrors NVIDIA’s spine layer. While local traffic within a Pod enjoys the “clockwork” precision of optical circuits, traffic between Pods enters a standard packet-switched network that behaves more like the traditional internet—dynamic and subject to minor jitter. This introduces the two-domain structure we saw earlier: absolute determinism within the Pod, and managed dynamism between Pods.

Jupiter’s role: Jupiter is Google’s unified data center network architecture. It connects not just TPU Pods to each other (in Multislice configurations), but also TPUs to Google’s massive storage infrastructure (Colossus), CPU fleets, and external internet gateways. By using standard Ethernet frames (with customized protocols), Jupiter enables interoperability across Google’s entire infrastructure.

Software Orchestration: XLA (Deterministic Clockwork)

Google’s compiler, XLA (Accelerated Linear Algebra), takes a radically different approach from NVIDIA’s NCCL. XLA assumes perfect knowledge of the network topology and perfect reliability.

Compile-time scheduling: When a JAX program is compiled, XLA analyzes the computation graph and the physical TPU mesh topology. It calculates the exact nanosecond when each piece of data will leave Chip A and arrive at Chip B. There is no runtime negotiation, no dynamic routing, no “checking if the path is clear.” The schedule is computed once, at compile time, and executed blindly.

No handshakes: In NCCL, a sender waits for acknowledgment before transmitting. In XLA, senders transmit without waiting. They know—mathematically—that the receiver will be ready because XLA scheduled the receiver’s computation to complete exactly when the data arrives. This “clockwork execution” eliminates all handshake overhead but requires absolute predictability.

The trade-off: If a TPU fails mid-computation, XLA cannot dynamically reroute. The entire training job typically must restart from the last checkpoint. Google accepts this trade-off because their infrastructure is designed for extremely high reliability (more on this in Section 6), and the performance gains from eliminating all runtime overhead outweigh the cost of occasional restarts.

5. The Physics of Data Movement: Massive Flows and Synchronization

To understand why these radically different topologies exist, we must examine the traffic patterns they’re designed to handle. AI training doesn’t generate random, bursty traffic like a web server. It generates synchronized, massive data transfers.

The All-Reduce Storm

The fundamental operation in data-parallel training is All-Reduce. At the end of each training step, every GPU has computed gradients (updates to the model weights) based on its batch of data. To calculate the true average gradient, every GPU must share its results with every other GPU.

The traffic pattern:

Volume: For a trillion-parameter model, each All-Reduce operation moves terabytes of data
Timing: It happens simultaneously across all chips—every GPU sends data at the exact same millisecond
Synchronization: Training cannot proceed to the next step until every GPU has received every other GPU’s gradients

This is fundamentally different from traditional networking workloads. There’s no “bursty” traffic—it’s a continuous drumbeat of synchronized massive flows. Every few milliseconds, the entire supercomputer pauses to perform All-Reduce, then resumes computation.

Tail Latency: The Straggler Problem

In a synchronized system, performance is determined by the slowest operation, not the average operation. If 99,999 packets arrive in 10 microseconds but a single packet takes 10 milliseconds because it got stuck in a switch buffer, the entire 100,000-GPU cluster halts for 10 milliseconds waiting for that straggler.

Congestion Control: Different Philosophies

Both companies have engineered solutions to eliminate tail latency, but through opposite approaches.

NVIDIA (Reactive): InfiniBand uses credit-based flow control. A sender cannot transmit until the receiver has explicitly signaled that buffer space is available. This makes InfiniBand “lossless by design”—packet drops are physically impossible. When congestion occurs, the network pushes back against senders, forcing them to slow down.

The Quantum-X800 adds adaptive routing on top of this: when one path gets congested, traffic automatically sprays across alternative paths. Individual packets may take different routes, but they all arrive reliably, and the receiving NIC reassembles them in order.

Result: NVIDIA’s approach handles congestion reactively. The network continuously monitors its own state and adjusts routing dynamically to avoid bottlenecks. This works even when traffic patterns are unpredictable or when hardware components are operating at different speeds due to thermal throttling or partial failures.

Google (Preventative): The ICI fabric doesn’t have “congestion” in the traditional sense because communication is scheduled at compile time. XLA knows exactly which chips will communicate when, and it schedules operations such that no two transfers ever collide on the same optical circuit.

If the compiler cannot find a valid schedule (because the requested communication pattern would cause congestion), the program fails to compile rather than failing at runtime. The developer must redesign the algorithm to fit the hardware’s communication capacity.

Result: Google’s approach prevents congestion through perfect planning. The network never encounters unexpected traffic because every data movement was calculated in advance. This requires predictable workloads and stable hardware but delivers maximum efficiency when those conditions are met.

6. Two Network Operations Doctrines

The architectural differences between NVIDIA and Google’s networks create fundamentally different operational philosophies. How do you keep a supercomputer running when components inevitably fail?

NVIDIA: “Detect & Adapt”

NVIDIA’s approach is built on the assumption that hardware will fail, and the system must gracefully adapt.

The architecture enables flexibility: Because InfiniBand uses dynamic routing and NCCL performs runtime discovery, the system can route around failures. When a switch fails, NCCL detects it (usually within seconds) and rebuilds the communication tree using alternative paths.

Switch failure: If a Quantum-X800 switch fails, NCCL dynamically reroutes traffic through alternative switches. The cluster continues operating, potentially with reduced bandwidth on certain paths, but without halting.

Rack failure: If an entire NVL72 rack fails (power outage, cooling failure, etc.), the cluster can isolate that rack and continue training with the remaining racks. For a data-parallel workload trained across 1,000 racks, losing one rack means restarting from the last checkpoint with 999 racks—annoying but not catastrophic.

Cable failure: Individual cable failures are detected automatically. NCCL marks the failed path as unavailable and routes around it. Cables can be replaced during maintenance windows without shutting down the entire cluster.

The cost: Performance variability. With dynamic routing and rerouting, job-to-job performance varies based on which specific hardware components are currently operational. A training run might take 10% longer this week than last week because two switches are down for maintenance. NVIDIA accepts this variability in exchange for continuous operation.

Monitoring and telemetry: NVIDIA’s infrastructure relies heavily on runtime monitoring. Every switch, cable, and NIC continuously reports health metrics. When anomalies are detected (increased error rates, higher-than-expected latency), the system can proactively isolate potentially failing components before they cause job failures.

Google: “Predict & Purge”

Google’s approach assumes that with sufficient care, hardware won’t fail—and when it does, you remove it before it causes problems.

The architecture requires perfection: Because XLA schedules communication down to the nanosecond, a single “slow” chip (not even broken, just lagging due to thermal issues) breaks the global clockwork. All chips must operate in perfect synchrony, or the deterministic schedule collapses.

Aggressive telemetry: Google’s management software (Borg/GKE) constantly monitors error rates, thermal variance, and performance metrics. If a chip shows pre-failure symptoms—slightly elevated error rates, minor thermal throttling, inconsistent latency—the system proactively evicts the workload from that Pod or migrates it to healthy hardware.

Proactive replacement: Rather than waiting for components to fail, Google uses telemetry to predict failures. A TPU showing signs of degradation is removed from service during scheduled maintenance and replaced before it impacts production workloads.

Frequent checkpointing: Training jobs checkpoint every few minutes. When a failure occurs (or when a component is proactively removed), the job restarts from the most recent checkpoint, losing only minutes of work. The cost of restarting is low enough that dynamic rerouting is unnecessary.

The benefit: Predictable performance. Every training run on a given Pod configuration achieves identical performance because there’s no dynamic routing introducing variability. This makes capacity planning straightforward and performance debugging easier—if a job is slower than expected, it’s a software problem, not a hardware configuration issue.

7. Which Network Topology Is Better?

Both architectures successfully train trillion-parameter models, but they optimize for different values and constraints.

NVIDIA’s hierarchical, resilient approach is ideal for:

Multi-tenant cloud environments where diverse customers run varied workloads with different scaling requirements
Organizations with varying operational capabilities that may not maintain Google-level infrastructure discipline
Workloads requiring flexibility where jobs must adapt to partial cluster availability or hardware heterogeneity
Incremental scaling where infrastructure grows gradually (72 → 144 → 572 GPUs) rather than in massive Pods

Google’s flat mesh, deterministic approach is ideal for:

Single-tenant research environments training frontier models where the entire cluster serves one purpose
Organizations with infrastructure maturity capable of maintaining ultra-high reliability through proactive management
Workloads demanding performance predictability where consistent iteration time accelerates research progress
Large-scale deployments where provisioning entire 9,216-chip Pods makes economic sense

Each architecture reflects different engineering values:

NVIDIA values resilience and flexibility—the network must work in messy, real-world conditions with imperfect hardware
Google values efficiency and predictability—the network operates as deterministic clockwork, assuming infrastructure excellence

Both approaches have successfully trained the world’s largest models. The choice depends not on technical superiority but on organizational fit.

8. Conclusion: The Two-Tier Reality

The network topologies engineered by NVIDIA and Google represent the two dominant philosophies of the AI era: NVIDIA’s resilient hierarchy versus Google’s deterministic mesh. Yet, as models scale beyond 100,000 chips, physics is forcing a structural similarity that neither company explicitly planned. Both companies have discovered that a trillion-parameter cluster requires two distinct domains:

Domain 1: The “Super-Node” Fabric - A massive, proprietary, ultra-low-latency island where compute is tightly coupled.

For Google, this is the Pod (up to 9,216 TPUs in TPU v7). Inside this boundary, the optical mesh creates a deterministic, flat “bubble” of perfect synchronization. Every chip is equidistant from every other, and XLA schedules communication with nanosecond precision.

For NVIDIA, this is the Rack (72 to 572 GPUs connected via NVLink). By moving to a copper backplane, NVIDIA has essentially turned the rack into a single giant GPU, mimicking the tight coupling of a Google Pod—just at smaller scale with different trade-offs.

Domain 2: The Scale-Out Fabric - A standard, packet-switched network to connect these islands.

For Google, this is Multislice (Jupiter). They have conceded that the flat mesh cannot scale infinitely. To grow beyond 9,216 chips, they must introduce hierarchy, connecting Pods via standard data center networking. Traffic between Pods uses packet-switched Ethernet, entering a world of dynamic routing and managed congestion—exactly what they avoided within the Pod.

For NVIDIA, this is the InfiniBand/Ethernet fabric with Rail-Optimized topology. They use this to bridge their massive-scale racks, employing SHARP in-network computing and adaptive routing to maintain high performance across building-scale distances.

Both have accepted the two-domain structure (super-node + scale-out). Google’s flat mesh uses a hierarchy to scale beyond the Pod (Multislice). NVIDIA’s hierarchy uses massive “flat” nodes to scale (NVL72/144/572). Zoom out, and we see huge islands of determinism connected by oceans of dynamic networking.

A fundamental philosophical divide persists: hierarchical versus flat, dynamic versus deterministic, resilient versus efficient.

NVIDIA believes the network should adapt to the hardware’s imperfections. Google believes the hardware should be perfectly configured by software.

These are not merely technical choices—they reflect different assumptions about how a supercomputer should be built and operated.

In the next article, “Parallelism: The Blueprint of Training”, we will leave the physical layer and move up the stack. We will examine how these topological choices dictate the specific parallelism strategies for training trillion-parameter models.

Two Compute Nodes: Physical vs. Virtual

Tony Wan — Thu, 11 Dec 2025 00:49:23 GMT

1. Introduction

In the previous articles, we examined the silicon engines (Article 2), the density strategies (Article 3), and the software ecosystems (Article 4) that power AI supercomputers. Now we turn to the fundamental question: What is the atomic unit of computation?

For the last decade, the industry agreed on a definition: a “node” was a discrete, physical server chassis housing 8 GPUs. This “supercomputer-in-a-box” (like the NVIDIA DGX) was the standard building block. You connected thousands of these nodes with high-speed networking to build a cluster.

However, as we enter the Gigawatt era,the definition of the “compute node” has bifurcated into two contrasting philosophies, each driven by the architectural choices we explored in Article 1:

Path A (NVIDIA): The Physical Evolution. The node is getting bigger. NVIDIA has evolved from a server-scale node (8 GPUs) to a rack-scale node (72+ GPUs), turning the entire cabinet into a single, liquid-cooled monolith. The strategy is to compress maximum compute into a defined physical volume.

Path B (Google): The Virtual Evolution. The node is disappearing. Google has abstracted away the physical hardware, treating the “node” as a virtual entry point into a massive, flat mesh of TPUs. The strategy is to make the physical packaging irrelevant so the fabric can grow without bounds.

This article deconstructs how the physics of copper, cooling, and fabric have driven NVIDIA to build bigger boxes, while Google has made the box disappear entirely.

2. The NVIDIA Path: Physical Scale-Up

NVIDIA’s philosophy is rooted in density. To make models run faster, compress the compute into the smallest possible physical space. This has driven a two-phase evolution: from the server to the rack.

Phase 1: The Server-Scale Node (DGX B200)

For enterprise deployments and standard clusters, the atomic unit is the 8-GPU server. The DGX B200 represents the current instantiation of this design philosophy—the most powerful air-cooled system in a server footprint.

Why 8 GPUs? The Engineering Limits

The number 8 results from two hard physical constraints:

The Copper Limit: At NVLink speeds (1.8 TB/s bidirectional), electrical signals degrade rapidly. The 8-GPU chassis is the maximum size where you can connect all chips via passive copper traces on a motherboard without signal integrity issues.
Switch Radix: Eight GPUs saturate the port count of the internal NVSwitch chips. Adding a 9th GPU would break the non-blocking fabric architecture.

The Internal Fabric: A Single NVLink Domain

Inside the DGX B200, the 8 GPUs function as one unified system:

GPU-to-GPU (East-West): 18 NVLink 5.0 connections per GPU provide 1.8 TB/s of bidirectional bandwidth. This switching happens entirely inside the box.
CPU-to-GPU (North-South): The system uses the GB200 Superchip, replacing legacy PCIe with NVLink-C2C. This fuses the Grace CPU and Blackwell GPU into a single memory space, allowing GPUs to access the 960GB of LPDDR5X system memory as their own.

The Thermal Wall: Pushing Air Cooling to Its Limit

Form Factor: The DGX B200 expanded from 8U (DGX H100) to 10U to accommodate larger heatsinks.
Power Draw: 14.3 kW—a 40% increase over the 10.2 kW DGX H100.
Deployment Challenge: At this power density, operators must leave empty rack slots to prevent “hot spots,” and fans scream at 10,000+ RPM.

The DGX B200 delivers 144 PetaFLOPS (FP8) per node, but it represents the absolute limit of what air cooling can achieve.

Phase 2: The Rack-Scale Node (GB200 NVL72)

To break the 8-GPU limit, NVIDIA redefined the physical boundaries of the node. The node is no longer a server; the node is the entire rack.

The Monolith: 72 GPUs as One Computer

The GB200 NVL72 is a 120 kW, liquid-cooled, 42U rack. It is not a collection of servers—it’s a single computer:

The Architecture: 72 B200 GPUs and 36 Grace CPUs are hard-wired together with a massive copper backplane (the “spine”) that runs vertically through the cabinet.
The Copper Advantage: Because the GPUs are stacked vertically in one cabinet, NVIDIA can connect all 72 using passive copper cabling. This avoids power-hungry optical transceivers, saving approximately 20kW per rack.
Unified Memory: The entire system functions as one giant accelerator with 31 TB of unified memory (13.8 TB HBM3e + 17.3 TB LPDDR5X).

Performance: The Generational Leap

The evolution from Hopper through Blackwell and into the roadmapped Vera Rubin platforms reveals the trajectory of rack-scale computing:

The GB200 NVL72 achieves 81x the raw compute of the DGX H100, while simultaneously improving power efficiency by 6.4x (from ~1.57 to ~10 PetaFLOPS per kilowatt). The roadmap shows NVIDIA’s commitment to continuing this rack-scale expansion, with the Vera Rubin Ultra NVL572 targeting nearly 4x the performance of today’s GB200 NVL72.

Liquid Cooling: The Infrastructure Mandate

At 120 kW per rack, air cooling is physically impossible. The GB200 NVL72 uses direct-to-chip (DTC) liquid cooling, where cold plates are mounted on every GPU and CPU:

Water Requirement: Standard facility water (inlet temperature up to 25°C / 77°F).
Infrastructure Impact: Data centers must install industrial-scale hydraulic systems, circulating millions of gallons of coolant.

As rack-scale nodes continue to grow (144 GPUs, then 572 GPUs), the power density and cooling requirements will intensify further, driving data centers toward becoming industrial power plants with integrated hydraulic infrastructure.

The Network: Pushing the Bottleneck

The critical innovation is where the “cluster” begins:

DGX B200 (Server-Scale): The cluster begins immediately outside the 8-GPU chassis. Even communication between servers in the same rack must traverse the external InfiniBand network (800 Gb/s).
GB200 NVL72 (Rack-Scale): The cluster begins only when connecting one rack to another. All 72 GPUs communicate over the internal NVLink fabric, avoiding the external network entirely. This pushes the network bottleneck from every 8 GPUs to every 72 GPUs—a 9x reduction in inter-node traffic.

Scaling Further: The Multi-Generation Roadmap

The GB200 NVL72 is not the endpoint of NVIDIA’s rack-scale vision. Using the 5th-generation and future NVLink Switch Systems, NVIDIA is systematically expanding the maximum size of a single NVLink Domain:

Today - Blackwell (2024-2025):

GB200 NVL72: 72 GPUs in a single rack
GB200 NVL72 Pods: Up to 8 racks = 576 GPUs in one continuous NVLink Domain

Near Future - Vera Rubin (2026):

Rubin NVL144: 144 GPUs in a multi-rack configuration
This effectively doubles the size of the rack-scale node, allowing massive models to remain within a single, high-speed memory space without crossing into slower inter-rack networking.

Far Future - Vera Rubin Ultra (2027):

Rubin Ultra NVL572: 572 GPUs in a single NVLink Domain
At this scale, the “Super-Node” effectively becomes a small supercomputer, with nearly 600 GPUs addressable as one giant, coherently-cached brain.
The 5,000 PetaFLOPS of compute in a single domain approaches the entire computational capacity of early supercomputing facilities.

This roadmap reveals NVIDIA’s long-term strategy: expand the physical node to encompass what was once an entire cluster. By 2027, the rack-scale node will be 8x larger than today (572 vs 72 GPUs), effectively turning multiple aisles of racks into one massive, unified logic gate.

Implication: The “network bottleneck” that limits distributed training will be pushed from every 72 GPUs (today) to every 572 GPUs (2027)—nearly eliminating it for most trillion-parameter model training workloads.

3. The Google Path: Virtual Scale-Out

Google’s philosophy is rooted in fabric scale. They prioritize the size of the network over the density of the box. Consequently, their “node” story is not about building a bigger box, but about making the box irrelevant so the matrix can grow.

The “Invisible” Physical Node

If you walked into a Google data center and pulled out a TPU server tray, you would hold a relatively modest piece of hardware: a standard-sized board with 4 or 8 TPU chips.

Unlike the NVIDIA DGX, this physical node is architecturally transparent:

No Local Switch: There is no equivalent to NVSwitch inside the tray.
Passthrough Design: Every single TPU chip has optical ports that bypass the board and connect directly to the massive data center fabric (the Optical Circuit Switches).

The philosophy: The physical server tray is just a holder for the silicon. The software treats the tray as a simple socket.

The Virtual Node: The TPU Pod

Because the physical node is abstracted, the “Atomic Unit” for a Google engineer becomes the TPU Pod—a massive, flat mesh of thousands of chips.

Scaling the Matrix

Instead of building denser racks (Scale-Up), Google scales by adding more TPUs to the flat fabric (Scale-Out):

TPU v4 Pod: 4,096 chips in a 3D Torus mesh
TPU v5p Pod: 8,960 chips in a 3D Torus mesh
TPU v7 “Ironwood”: 9,216 chips

This represents a Pod size 16x larger than NVIDIA’s current maximum NVLink domain (576 GPUs in a GB200 NVL72 Pod).

However, with NVIDIA’s roadmap targeting 572 GPUs in a single Rubin Ultra NVL572 domain, the gap is narrowing—NVIDIA is converging on Google’s Pod-scale integration through the inverse path of extreme rack-scale density.

The Interconnect: ICI (Inter-Chip Interconnect)

This virtual node relies on Google’s proprietary ICI fabric:

Bandwidth: Each TPU v5p has 600 GB/s of ICI bandwidth (compared to NVIDIA’s 1.8 TB/s).
Topology: These links connect each chip directly to its 6 neighbors in the mesh (north, south, east, west, up, down).
The Trade-off: Google accepts lower per-chip bandwidth in exchange for a massive, flat fabric that requires no hierarchy.

The Optical Circuit Switch (OCS): The Physical Fabric

Unlike NVIDIA’s electrical NVLink switches, Google’s ICI relies on Optical Circuit Switching (OCS)—a fundamentally different networking technology:

Physical Mechanism: The OCS uses arrays of tiny, steerable mirrors (MEMS mirrors) to redirect beams of light. Each mirror can physically rotate to point a laser beam from any input fiber to any output fiber, creating a direct optical path between two TPU chips.
Circuit-Switched Architecture: Unlike traditional packet-switched networks (like InfiniBand), the OCS does not forward packets. Instead, it establishes dedicated optical circuits—direct “light pipes” between TPU pairs. Once a circuit is established, data flows at full speed with zero switching overhead.
Static Topology: The 3D Torus topology is programmed into the OCS at the start of a training job. The mirrors are set to specific angles, creating the mesh pattern, and they remain in that configuration for the duration of the workload. This is why Google’s approach is called “deterministic”—the network topology is fixed and known in advance.
The Reconfiguration Trade-off: Physically rotating MEMS mirrors takes milliseconds—far too slow for dynamic packet routing. However, because AI training workloads are predictable and long-running (days to weeks), this static topology is ideal. The XLA compiler knows the exact mesh structure and schedules all data movement accordingly.

This is the critical innovation that enables Google’s “virtual node” concept: the physical server tray is irrelevant because every TPU is directly connected to every other TPU through pre-configured optical paths. There is no hierarchy, no switch hops, no routing decisions—just dedicated optical circuits forming a massive, flat mesh.

Why Lower Bandwidth Works

At first glance, 600 GB/s (TPU) versus 1.8 TB/s (GPU) seems like a massive disadvantage. However, the OCS architecture delivers advantages that offset the lower per-chip bandwidth:

Uniform Distance: In a 3D Torus, every chip is equidistant from every other chip (in terms of network hops). There are no “fast” chips and “slow” chips. Every TPU is exactly 6 optical hops from any other TPU, creating predictable latency.
Zero Switch Overhead: Because the OCS establishes dedicated optical circuits, there is no packet switching, no routing lookups, no buffer management, and no congestion control. The full 600 GB/s bandwidth is available without protocol overhead.
Deterministic Routing: Because the topology is fixed and known at compile time, XLA can schedule data movement with perfect precision, eliminating congestion. The compiler calculates the exact nanosecond when each data transfer will occur, ensuring no two transfers ever collide on the same optical path.
Scale Advantage: The ability to keep 9,216 chips in a single domain means that massive workloads never leave the optical fabric—they don’t hit the slower Tier-2 network that even NVIDIA’s largest future clusters will require beyond 572 GPUs.

The result: Google’s lower per-chip bandwidth delivers higher effective utilization because the OCS eliminates the inefficiencies of packet-switched networks.

4. Liquid Cooling the Chip

While NVIDIA positioned liquid cooling as a recent breakthrough necessary for the Blackwell generation, Google has been using liquid cooling in production data centers since 2017. However, their approach differs significantly from NVIDIA’s integrated rack-scale design.

TPU v2/v3 (2017-2018): The Liquid Cooling Transition

With TPU v2, Google adopted liquid cooling at the board level—each TPU server tray includes integrated cold plates, but unlike NVIDIA’s GB200 NVL72, the cooling infrastructure is not monolithically integrated into the rack. The TPU trays remain modular with quick-disconnect fittings for coolant lines, while cooling distribution units (CDUs) are rack-mounted but separate from compute trays.

TPU v4/v5p/v7 (2021-Present): Advanced Liquid Cooling

The current generation continues with liquid cooling at higher power densities. TPU v5p chips consume approximately 275W, while TPU v7 “Ironwood” is estimated at 400-600W per chip given its 10x performance leap.

The Key Difference: Modular vs. Monolithic

The fundamental distinction reflects their broader architectural strategies:

Google’s Approach (Modular):

Each TPU tray is an independent, liquid-cooled unit with quick-disconnect fittings
Rack is just a physical shelf; cooling infrastructure is separate from compute hardware
Failed trays can be hot-swapped without draining the rack
Philosophy: Liquid cooling is a utility provided by data center infrastructure

NVIDIA’s Approach (Integrated):

GB200 NVL72: Cooling system built directly into the 42U rack
Entire rack ships as one pre-integrated unit with coolant distribution, pumps, and manifolds
Philosophy: Liquid cooling is part of the product

5. The Engineering Trade-offs

NVIDIA: Granularity and Modularity (The Rack-Scale Node)

The NVL72 / NVL144 / NVL576 architectures prioritize Composable Scale. Even though the “node” has grown from a server to a rack (or row), it remains a flexible building block.

Elastic Deployment: A data center can deploy a single GB200 NVL72 (72 GPUs) for inference, or connect eight of them into a Rubin Ultra NVL576 (576 GPUs) for training. You are not forced to deploy a warehouse-sized mesh to get started.
Physical Segmentation: The system respects physical boundaries. A 72-GPU rack is a discrete thermal and power domain. If one rack fails or needs maintenance, it can be isolated without destabilizing a 10,000-chip cluster.
Multi-Tenancy: This architecture is ideal for cloud providers who need to serve diverse workloads—allocating one rack to Team A for Llama 3 training, and another to Team B for Claude inference.

Verdict: Ideal for Enterprise AI and Cloud, where flexibility and incremental scaling are critical.

Google: Integration and Unity (The TPU Pod)

The TPU Pod architecture prioritizes Maximum Integration. It treats the entire warehouse as a single device.

The “Zero-Boundary” Fabric: In a 9,216-chip Ironwood Pod, the workload stays entirely within the optical ICI fabric. There is no “performance cliff” every 72 or 576 chips; the mesh is continuous.
Simplified Programming: To the XLA compiler, the Pod looks like one giant 9,000-core processor. Developers don’t need to manually partition the model across separate racks; the compiler handles the data flow across the entire mesh.
Stability via Rigidity: Because the topology is fixed and uniform, the system is less flexible but more predictable. A broken chip is treated as a “bad sector” on a hard drive—mapped out by software without disrupting the massive job.

Verdict: Ideal for Frontier Model Training (Gemini, Claude), where a single massive workload monopolizes the entire cluster for months.

The Convergence

Interestingly, both approaches are converging toward similar goals through different paths, now that both companies are doing their version of both scaling up and scaling out:

NVIDIA: The GB200 NVL72 effectively creates a “mini-Pod” of 72 GPUs. By 2027, the Rubin Ultra NVL576 will expand this to 576 GPUs—closing the gap with Google’s integration. It mimics the TPU Pod’s unity but retains the rack-based modularity.
Google: The TPU v7 “Ironwood” represents a 10x leap in per-chip performance. By chasing NVIDIA-class density, Google is reducing the number of chips needed to do the same work, effectively making their massive mesh more potent per square meter.

The Key Difference: From a networking perspective, NVIDIA’s approach remains Modular (the Rack is the unit), while Google’s approach remains Monolithic (the Pod is the unit).

5. Conclusion: Two Definitions of the Node

The “compute node” has bifurcated to meet different strategic goals:

NVIDIA defines the node by physics. They compress more power into a defined physical volume (Server → Rack → Multi-Rack), using copper and liquid cooling to maximize density. The result is the Super-Node:

Today: GB200 NVL72 (72 GPUs, 31 TB unified memory)
2026: Vera Rubin NVL144 (144 GPUs)
2027: Vera Rubin Ultra NVL572 (572 GPUs, 5,000 PetaFLOPS)

Google defines the node by fabric. They expand the network to encompass more chips, rendering the physical packaging architecturally irrelevant. The result is the Super-Matrix (TPU Pod)—a 9,216-chip optical mesh where every chip is equidistant from every other.

Both philosophies solve the same problem—training trillion-parameter models—through inverse strategies:

NVIDIA says: “Build ultra-dense islands, then connect them.”
Google says: “Build a massive, seamless fabric that eliminates the islands.”

By 2027, NVIDIA’s largest “island” (Vera Rubin Ultra NVL572 with 572 GPUs) will approach the scale of a small supercomputer, while Google’s “fabric” will continue to seamlessly connect over 9,000 chips. The two approaches will still differ in granularity and deployment philosophy, but both will deliver multi-exaFLOP capability within a single coherent memory domain.

Regardless of whether the node is a physical monolith or a virtual matrix, these massive systems share a common vulnerability: they are composed of millions of parts, and parts break.

In the next article, “Article 6: Two Network Topologies,” we will examine how NVIDIA’s hierarchical NVLink fabric and Google’s flat ICI mesh handle the physics of data movement at gigawatt scale.

Two Software Foundations for Scale

Tony Wan — Tue, 09 Dec 2025 13:38:20 GMT

1. Introduction

The B200 and TPU v7 we examined in Article 3 represent the pinnacle of silicon density. But to unlock their trillion-FLOP potential, they require fundamentally different software philosophies: Kernel-Centric (NVIDIA) and Compiler-Centric (Google).

To understand the difference, we must look at the hierarchy of how software talks to hardware.

In any computer, there is a gap between the high-level code a human writes (Python) and the low-level electrical signals a chip understands. There are two competing software foundations to bridge this gap.

The Kernel-Centric Approach (NVIDIA)

In this model, the software controls the hardware step-by-step.

How it works: The high-level software breaks the AI model down into a sequence of individual operations (e.g., “Multiply Matrix A by Matrix B,” then “Save to Memory”).
The “Kernel”: For each operation, the system calls a specific, pre-written mini-program called a “Kernel” (from libraries like cuDNN). The GPU executes this kernel, reports back, and waits for the next command.
The Result: This provides maximum flexibility. Because the software manages every step individually, researchers can easily change the model architecture on the fly without breaking the system.

The Compiler-Centric Approach (Google)

In this model, the software controls the hardware by creating a comprehensive plan upfront.

How it works: Instead of sending commands one by one, the high-level software sends the entire mathematical graph of the AI model to a Compiler (XLA).
The Optimization: The compiler analyzes the whole program at once. It looks for efficiencies—like combining three separate math steps into a single hardware action—and generates a single, monolithic binary file.
The Result: This provides maximum efficiency. By planning the entire route before the data moves, the compiler eliminates the pause between steps, but it makes the system more rigid and harder to change during runtime.

Understanding these two stacks is critical because they dictate how an AI Supercomputer is architected, how models are coded, and how performance is unlocked at the silicon level.

2. The Foundation: CUDA vs. XLA

The deepest point of divergence lies in how the developer interacts with the silicon.

NVIDIA: The CUDA “Kernel” Approach

The foundation of the NVIDIA stack is CUDA (Compute Unified Device Architecture). It allows developers to write “kernels”—C++ functions that execute directly on the GPU’s parallel cores. This model is imperative: the developer (or library author) explicitly manages memory allocation, thread synchronization, and data movement.

The Mechanism: Most researchers never write raw CUDA. Instead, they leverage libraries like cuDNN (CUDA Deep Neural Network library). These contain hand-written, assembly-optimized kernels for specific operations (e.g., FlashAttention-3 or FP8 Matrix Multiplication). When a framework like PyTorch executes torch.matmul, it dispatches these pre-compiled binaries to the GPU.
The Trade-off: This requires significant engineering effort to hand-tune kernels for each new GPU architecture (e.g., Hopper vs. Blackwell), but it offers maximum flexibility for dynamic workloads.

Google: The XLA “Compiler” Approach

Google’s TPUs (Tensor Processing Units) leverage XLA (Accelerated Linear Algebra). Unlike CUDA, XLA is not a programming language but a domain-specific compiler. The developer does not write kernel code for a TPU.

The Mechanism: Frameworks like JAX emit a computational graph. XLA analyzes the entire graph at once. It performs whole-program optimization, fusing multiple operations (e.g., “Add + Multiply + Activation”) into a single hardware instruction to minimize memory access.
The Trade-off: This approach is deterministic. The compiler calculates the exact clock cycle for every memory movement before the program runs. This eliminates the runtime overhead of managing threads but makes the system rigid; dynamic shapes or conditional branching can trigger costly re-compilations.

Verdict: When is each approach superior?

CUDA Wins: When Algorithm Velocity is the priority. If you are inventing new layer types daily (e.g., sparse mixture-of-experts, state-space models), CUDA allows you to write a custom kernel and run it immediately without waiting for a compiler to understand the new math.
XLA Wins: When Production Efficiency is the priority. If your model architecture is stable (e.g., a standard Transformer), XLA can squeeze more performance out of the silicon by fusing operations that a human developer might miss.

Practical Implication for Trillion-Parameter Models

For frontier model training, the “Kernel” approach (NVIDIA) currently dominates because research velocity is paramount. Debugging a compilation error on a 10,000-chip XLA cluster can take days.

However, once a model architecture is frozen, the XLA approach offers a theoretical path to lower training costs by maximizing Model Flops Utilization (MFU).

3. How Close to the Metal is CUDA vs XLA?

To understand the difference between these ecosystems, we must visualize the layers of abstraction between the programmer’s Python code and the silicon transistors.

The NVIDIA Stack (The Layered Hierarchy)

CUDA is not the absolute lowest level of access to an NVIDIA GPU, but it is the lowest level accessible to software developers. It relies on a stack of layers to translate code into action.

Bare Metal: The physical hardware (Silicon, Transistors, Memory).
Microcode/Firmware: Internal low-level instructions that control voltage and basic operations. (Inaccessible to humans).
PTX & SASS: The assembly languages the GPU actually reads. The driver translates CUDA into these machine instructions.
CUDA: The programming layer. It sits on top of the driver, allowing developers to write C++ code that explicitly manages the GPU’s memory and processor threads.
Libraries (cuDNN): Collections of pre-written, highly optimized CUDA code for specific math tasks.
Frameworks (PyTorch): The user-friendly layer that calls the libraries.

The Google Stack (The Direct Compilation)

The XLA approach is flatter. It removes the intermediate “programming layer” (CUDA) entirely.

Bare Metal: The physical TPU hardware.
VLIW Machine Code: The TPU requires massive, complex instructions called “Very Long Instruction Words” that control hundreds of chip components simultaneously. This code is too complex for humans to write manually.
XLA (The Compiler): Instead of a human writing code to manage the chip, the XLA compiler translates the high-level math from the framework directly into the machine code required by the hardware.

The Comparison:

CUDA is “Close to the Metal” via Control: It allows the developer to manually manage the hardware resources. You explicitly define how to split the work across the chip’s cores and which memory banks to use.
XLA is “Close to the Metal” via Optimization: It allows the software to generate a monolithic machine-code program that perfectly matches the hardware’s physical layout and timing, without human intervention.

4. Multi-Chip Communication: NCCL vs. ICI

Training large models requires distributing the workload across thousands of chips. The software that orchestrates this data movement differs fundamentally between the two ecosystems.

NVIDIA: NCCL (Hierarchical & Dynamic)

NCCL (NVIDIA Collective Communication Library), pronounced “nickel”, is a topology-aware library designed to navigate the complex, two-tier hierarchy of GPU clusters. It performs runtime discovery of the network topology to route data.

Tier 1 (Intra-Node/Rack): Inside a GB200 NVL72 rack, NCCL utilizes NVLink 5.0 to move data at 1.8 TB/s (bidirectional).
Tier 2 (Inter-Node): Between racks, NCCL switches protocols to use InfiniBand or Ethernet (RoCEv2) via ConnectX-7/8 NICs.

In-Network Reduction (SHARP): For the Blackwell generation, NCCL leverages SHARP. This offloads collective operations (like AllReduce) to the NVSwitch silicon itself. Instead of GPUs exchanging data and performing summation locally, the switch performs the math as the data passes through it, dramatically reducing latency.

Google: ICI (Flat & Static)

Google TPUs are connected via a proprietary ICI (Inter-Chip Interconnect). Unlike NVIDIA’s hierarchical tree, ICI connects TPUs in a fixed, flat 3D Torus mesh.

Compiler-Managed Routing: Because the hardware topology is static and known at compile time, XLA manages the communication. There is no dynamic routing protocol or packet headers. The compiler schedules data to move from Chip A to Chip B at specific clock cycles.
Performance: While ICI often has lower peak bandwidth than NVLink, the elimination of networking overhead results in extremely high utilization for predictable workloads.

When is each approach superior?

In Cloud/Rental Environments, if you are renting GPUs from Azure or CoreWeave, you may not get a physically contiguous block of racks. NCCL can dynamically detect the topology and route around fragmented allocation.
If you own the data center (like Google) and can guarantee a perfect 64x64x64 cube of chips, ICI eliminates the massive cost and power overhead of Ethernet/InfiniBand switches.

Practical Implication for Trillion-Parameter Models

The “Flat Mesh” (ICI) is technically superior for 3D Parallelism (splitting a model across Data, Tensor, and Pipeline dimensions) because neighbors are always equidistant.

However, NVIDIA has closed this gap with the NVL72, which effectively creates a “mini-mesh” of 72 GPUs that mimics the TPU’s advantage while retaining the flexibility of NCCL for the broader cluster.

5. NVIDIA’s NCCL Performance from Hopper to Blackwell

NCCL’s performance is directly tied to the hardware it runs on. The advancements from the H100 to the B200 and the new GB200 NVL72 architecture showcase a massive leap in communication speed and efficiency.

DGX H100 (Hopper)

The H100 generation set the baseline for modern AI clusters:

Intra-Node (within the 8-GPU server): NCCL utilizes 900 GB/s of total bidirectional bandwidth from 4th-generation NVLink.
Inter-Node (server-to-server): In multi-node scenarios using NDR InfiniBand networking, NCCL sustains communication speeds up to 400 Gb/s across nodes.

DGX B200 (Blackwell “Scale-Out”)

The DGX B200 is the 8-GPU “scale-out” server successor. It doubles the performance on both network tiers:

Intra-Node: NCCL leverages the 5th-generation NVLink, doubling the total bidirectional bandwidth to 1.8 TB/s (1800 GB/s) within the server.
Inter-Node: The DGX B200 uses 800 Gb/s networking (via InfiniBand Quantum-X800 or Ethernet Spectrum-X800). NCCL can sustain communication speeds up to 800 Gb/s across nodes, again doubling the previous capability.

GB200 NVL72 (Blackwell “Scale-Up”)

The GB200 NVL72 rack-scale system represents a fundamental architectural shift, and NCCL has been redesigned to exploit it.

A Single, Massive Fabric: NCCL no longer sees a small 8-GPU node. It sees a massive 72-GPU NVLink domain (which can scale to a 576-GPU pod) where all communication runs on 5th-gen NVLink, providing 1.8 TB/s of bandwidth to every GPU in the entire system.
In-Network Reduction (INR): This is the most significant performance innovation. The 5th-generation NVSwitch itself has compute engines. NCCL is able to offload the “reduce” (summation) part of the all-reduce operation directly into the network switch. Instead of GPUs waiting to receive data, perform math, and send it on, the switch performs the math as the data is in transit.

This INR feature dramatically accelerates gradient synchronization, freeing up the B200 GPUs’ compute cores to continue working on the next training step, which is a crucial factor in shortening training times for trillion-parameter models.

These hardware advancements, unlocked by NCCL, directly impact the most critical metric: total training time.

Note: Although NCCL is open-source, it is primarily developed and optimized by NVIDIA for their GPUs and networking hardware. Optimal functionality, especially for advanced features like 5th-Gen NVLink, In-Network Reduction, or GPUDirect RDMA, relies on proprietary NVIDIA drivers and libraries.

6. The High-Level Interface: PyTorch vs. JAX vs. TensorFlow

While CUDA and XLA handle the heavy lifting, researchers work in Python. The choice of framework dictates not just the syntax, but how the software interacts with the underlying “Kernel” or “Compiler” philosophy.

PyTorch (The NVIDIA Standard)

PyTorch is the dominant framework for AI research and the training of most large language models (LLMs) today (including GPT-4 and Llama 3).

Philosophy: “Eager Execution.” PyTorch runs code line-by-line, immediately executing the math on the GPU. This makes it feel like standard Python—easy to debug and flexible.
The Hardware Connection: It is deeply integrated with the Kernel-Centric (NVIDIA) stack. When you write torch.matmul, it immediately calls a specific pre-compiled CUDA kernel.
Evolution: To compete with compiler efficiencies, PyTorch 2.0 introduced torch.compile. This captures the model graph and uses TorchInductor to generate optimized kernels, blending its dynamic flexibility with compiler-like speed.

TensorFlow (The Hybrid)

TensorFlow (TF), developed by Google, is the mature incumbent. It sits awkwardly between the two philosophies.

Philosophy: Originally rigid and graph-based (TF 1.x), it pivoted to “Eager Execution” (TF 2.x) to match PyTorch’s ease of use. It now uses Keras as its high-level API, prioritizing user-friendliness.
The Hardware Connection: TensorFlow is a hybrid. It works excellently on NVIDIA GPUs (dispatching kernels), but because it is a Google product, it also integrates tightly with XLA. This allows it to run on TPUs, though with more overhead than JAX.
The Role: While less popular for training new frontier models today, TensorFlow remains massive in production environments due to its robust serving ecosystem (TFLite, TFServing).

JAX (The Compiler Native)

JAX is the modern successor to TensorFlow’s original vision, stripped of the bloat. It is the “purest” expression of the Compiler-Centric philosophy.

Philosophy: “Function Transformations.” JAX is not a neural network library; it is a math library that supports hardware acceleration. It forces a functional programming style (stateless, pure math).
The Hardware Connection: JAX is designed specifically to feed the XLA compiler. It uses Just-In-Time (JIT) compilation: it traces your Python function, compiles it into a single binary, and executes it on the TPU (or GPU).
The Superpower: Because it is pure math, JAX makes parallelization trivial. APIs like pmap (parallel map) allow researchers to split a model across thousands of chips with a single line of code—a task that is notoriously difficult in PyTorch and TensorFlow.

How do they relate?

PyTorch vs. TensorFlow: These are direct competitors. Both offer a flexible, “Python-first” experience. PyTorch won the research war because its debugging is superior; TensorFlow holds the enterprise ground because of its deployment tools.
TensorFlow vs. JAX: This is a generational shift. JAX is effectively “TensorFlow done right” for high-performance computing. It discards the baggage of Keras and data loaders to focus entirely on generating the fastest possible XLA graph.

When is each approach superior?

PyTorch Wins: For Researcher Velocity. If a training run crashes, PyTorch points you to the exact line of Python code that failed. This is invaluable when debugging complex, billion-dollar training runs.
JAX Wins: For Scaling Elegance. JAX’s pmap (parallel map) and shard_map APIs allow you to describe how to split a model across 10,000 chips in just a few lines of code, whereas PyTorch often requires complex add-ons (like Megatron-LM) to handle distributed training state.

Practical Implication for Trillion-Parameter Models

Use PyTorch: If you want access to the largest ecosystem of open-source models, tutorials, and engineers. It is the safe, flexible choice for NVIDIA hardware.
Use JAX: If you are building a custom supercomputer (using TPUs) or if you need absolute maximum mathematical efficiency at extreme scale (50,000+ chips).
Use TensorFlow: Generally avoided for new frontier model training, but critical if you are integrating into a legacy enterprise pipeline that requires robust mobile or edge deployment.

The industry has largely voted for PyTorch (used by OpenAI, Meta, xAI) simply because the talent pool of engineers who know it is vastly larger.

However, teams that are willing to pay the “JAX Tax” (learning a difficult new language) often report superior stability and efficiency at the extreme scale of 50,000+ chips.

7. Summary: The Engineering Trade-Off

8. Conclusion

Ultimately, the choice between these ecosystems represents a fundamental architectural decision rather than a binary right-or-wrong answer. It is a choice of strategic priority.

The NVIDIA path prioritizes agility. It allows research teams to iterate rapidly, debug easily, and leverage the massive community support of PyTorch. It relies on raw hardware power to mask software inefficiencies, making it the ideal choice for teams exploring the unknown frontiers of model architecture.

The Google path prioritizes efficiency. It demands a more rigorous, structured approach to coding upfront, but rewards engineers with a system that executes with deterministic precision at scale. It transforms the data center into a predictable “math factory,” ideal for stable, massive-scale production workloads.

As we move forward, the lines are blurring—PyTorch is adopting compiler techniques, and XLA is becoming more flexible—but understanding the distinct DNA of these two stacks remains essential for any infrastructure engineer.

In our next article, Article 5, "The Compute Node", we will examine the physical manifestation of the NVIDIA and Google ecosystem: the compute node.

Two Strategies for Maximum Density

Tony Wan — Tue, 09 Dec 2025 13:34:50 GMT

1. Introduction

In Article 1, we established that the “Gigawatt Machine” is defined by two competing architectural philosophies: NVIDIA’s hierarchical “Scale-Up” approach and Google’s flat “Scale-Out” approach.

These macroscopic architectures are not arbitrary; they are the direct expression of the microscopic silicon engines that power them. To understand why a data center looks the way it does, one must understand the limitations and capabilities of the processor at its core.

This article deconstructs the two primary engines of the AI era: the NVIDIA Blackwell B200 GPU and the Google TPU v7 “Ironwood.”

While both chips are Application-Specific Integrated Circuits (ASICs) designed for matrix multiplication, they represent divergent engineering paths.

2. The Physics of the Problem: Three Levers of Density

For decades, Moore’s Law allowed engineers to double performance simply by shrinking transistors. That era has ended. Modern photolithography has hit a hard physical barrier known as the “reticle limit”, the maximum size of a chip that can be etched in a single exposure (roughly 858 mm²).

Since engineers are nearing the limit of how many transistors they can squeeze on a chip, they have three remaining levers to increase performance:

The Process Lever (Transistor Density): The manufacturing process technology, measured in nanometers (nm). The “nm” label in a process node is shorthand for the generation of chipmaking technology. While historically it measured the physical size of a transistor, today it signifies a technology tier. Each step down the “node ladder” (e.g., 5nm —> 4nm —> 3nm) means increased compute density and higher performance.
The Packaging Lever (Silicon Area): Stitching multiple dies together to create a “Superchip” that physically exceeds the reticle limit.TSMC’s CoWoS (Chip-on-Wafer-on-Substrate) places multiple compute dies (GPU/TPU) and memory stacks (HBM) side-by-side on a massive silicon interposer, containing thousands of ultra-dense wires (far denser than standard circuit board).
The Precision Lever (Computational Density): Reducing the number of bits required for each calculation (e.g., 8-bit —> 4-bit). Traditional High-Performance Computing (like weather simulation) requires 64-bit precision. However, neural nets are more noise-resilient; distinguishing if next word is “the” or “cat” doesn’t require seven decimal places. The industry is moving to FP8 for training frontier models, while FP4 is being used for inference.

The Strategic Split:

In this generation, the two giants have pulled different levers.

NVIDIA paused the Process lever to go “all-in” on packaging and precision.
Google pulled the process lever.

3. The NVIDIA Strategy

NVIDIA’s Blackwell B200 represents a strategy of “architectural density.” Rather than moving to the 3nm process node, NVIDIA optimized the mature 4NP (4-nanometer) process to ensure high yields and reliability for a mass-market product.

The Dual-Die Design

The B200 is not a single chip.They manufactured two massive dies and stitched them together into a single GPU using advanced CoWoS-L packaging. This “chiplet” approach allowed them to pack 208 billion (vs. 80 Billion in Hopper)transistors into one unit. NVIDIA to effectively doubled the silicon area available for compute without the yield risks of a larger monolithic die.

The GB200 Superchip

NVIDIA’s packaging innovation extends beyond just making the GPU bigger. With the GB200 Grace Blackwell Superchip, they have fundamentally altered the definition of a “processor.”

The Innovation: Instead of plugging a GPU into a motherboard slot miles away (electronically speaking) from the CPU, NVIDIA packages two B200 GPUs and one Grace CPU onto a single board, fused by NVLink-C2C.
The Result: This is not just a “CPU next to a GPU”; it is a unified engine where the boundary between the two dissolves. By packaging them this tightly, NVIDIA eliminates the PCIe bottleneck entirely, allowing the GPUs to access the CPU’s massive LPDDR5X memory pool as their own.

The Precision Shift: FP4

The most significant leap in Blackwell is the sheer density of math operations it can perform.

H100 (Hopper): Delivered 4,000 TFLOPS of FP8 compute.
B200 (Blackwell): Doubles this to 8,000 TFLOPS of FP8 compute per GPU.
GB200 (Superchip): By fusing two B200s, the superchip delivers a staggering 16,000 TFLOPS of FP8 performance.

Crucially, the B200 architecture introduces native support for FP4 (4-bit Floating Point) precision. By processing data with half the bits of the previous generation, the chip can process twice as much data per clock cycle, doubling throughput again for inference workloads.

The Future of Density: Rubin and Rubin Ultra

NVIDIA’s ambition doesn’t stop at the chip. The goal is to make the “GPU” synonymous with the “Data Center.”

Today (Blackwell): The maximum size of a single NVLink domain is 72 GPUs (the GB200 NVL72). Beyond this, you must use a slower Tier-2 network.
Tomorrow (Rubin): With the upcoming Rubin (R100) architecture in 2026, NVIDIA plans to expand this domain to 144 GPUs.
The Horizon (Rubin Ultra): By 2027, the Rubin Ultra platform aims to connect 576 GPUs into a single, coherently addressed NVLink domain. This would effectively turn an entire aisle of racks into one massive, unified logic gate.

4. The Google Strategy

With the TPU v7 “Ironwood,” instead of relying solely on fabric scale, Google aggressively chased per-chip density by moving to TSMC’s cutting-edge 3nm (N3P) process. This allowed them to pack significantly more transistors into a single compute die (~700mm²) than was possible with 4nm technology.

The Economics of Yield: Why use a more expensive, lower-yield process than NVIDIA? Vertical Integration. Because Google does not sell the TPU chip (they sell the service), they can tolerate higher manufacturing costs and lower yields per wafer. They don’t need to protect hardware margins in the same way a merchant supplier like NVIDIA does.

Like Blackwell, the TPU v7 has moved beyond the monolithic die. It employs a dual-chiplet design, stitching together two massive compute dies and HBM memory stacks using advanced CoWoS packaging.

5. Head-to-Head: B200 vs. TPU v7

In an apples-to-apples comparison at FP8 precision, the two engines are remarkably close.

6. Conclusion

The B200 and TPU v7 represent two valid solutions to the end of Moore’s Law. While their per-chip performance is now effectively neck-and-neck (~4.6 PFLOPS), they differ radically in how they cluster that power.

At first glance, Google’s ability to connect 9,216 chips into a single domain seems vastly superior to NVIDIA’s limit of 72 GPUs (Blackwell) or 576 GPUs (Rubin Ultra). But this is not a case of “more is better”; it is a trade-off between Bandwidth Intensity and Fabric Extent.

NVIDIA: Scale Up

NVIDIA builds and connects ultra-dense islands.

The Domain: A single NVLink domain (72 or 576 GPUs) is a “tight” cluster. Inside this island, bandwidth is massive (1.8 TB/s per chip) and memory is fully unified. It is effectively one giant, coherently caching brain.
The Scale: To build a 100,000-chip supercomputer, NVIDIA connects ~1,400 of these islands together using InfiniBand.
The Advantage: This modularity allows for extreme flexibility. You can build a supercomputer of any size by simply adding more racks.

Google: Scale Out

Google builds massive, seamless meshes.

The Domain: A single TPU Pod (9,216 chips) is a “wide” cluster. The bandwidth per chip is lower (~1.2 TB/s), but the fabric extends much further without hitting a switch.
The Scale: To build a 100,000-chip supercomputer, Google connects ~11 of these massive pods together using Optical Circuit Switching.
The Advantage: This minimizes the “performance cliff” for massive models, as nearly the entire training run can stay within the optical fabric.

The Verdict:

They are neck-and-neck because they arrive at the same destination—the Gigawatt Machine—via different granularities.

NVIDIA stacks thousands of ultra-dense blocks.
Google weaves a few massive sheets.

Both architectures have successfully continued to scale even though the reticle limit has been reached.

But raw silicon power is useless without software to orchestrate it. In Article 4, “Two Software Foundations for Scale”, we examine how NVIDIA’s CUDA and Google’s XLA translate human intent into hardware action.

Two Silicon Foundations for Scale

Tony Wan — Tue, 09 Dec 2025 13:33:38 GMT

1. Introduction

To build a machine capable of training a trillion-parameter models, you need silicon designed for one specific purpose: massive, parallel matrix multiplication, collectively known as AI Accelerators.

While the industry often uses “GPU” as a catch-all term for these chips, the reality is a tale of two distinct lineages. To understand the Gigawatt Machine, we must trace the evolution of the two dominant families that power it:

The GPU (NVIDIA): A flexible, general-purpose parallel processor that evolved from rendering video games to training AI.
The TPU (Google): A specialized, domain-specific ASIC (Application-Specific Integrated Circuit) built from a blank sheet of paper to optimize matrix math.

This article traces the architectural history of these two families, defining the vocabulary—Tensor Cores, HBM, and Systolic Arrays—that you will need to understand the engineering deep dives in the rest of this series.

2. Key Performance Definitions

Regardless of the architecture (GPU or TPU), three metrics define the power of an accelerator. We will reference these throughout the series:

1. FLOPS (Compute) How much math the chip can do per second.

Example (NVIDIA B200): Delivers ~4,500 TeraFLOPS (FP8).
Example (Google TPU v7): Delivers ~4,614 TeraFLOPS (FP8).
Note how closely matched these two engines are when running the same 8-bit math.

2. HBM (Memory Capacity) How big a model fits on the chip.

Example (NVIDIA B200): Packs 192 GB of HBM3e.
Example (Google TPU v7): Matches this with 192 GB of HBM3e.
Why it matters: If the model doesn’t fit in HBM, it must be split across chips. Since both chips have the same capacity, they can hold the same size model slices, making the network the deciding factor for performance.

3. Memory Bandwidth (Speed) How fast data moves from memory to the compute cores.

Example (NVIDIA B200): 8 TB/s.
Example (Google TPU v7): ~7.4 TB/s.
Why it matters: This is often the true bottleneck in AI training. A fast chip with slow memory spends most of its time idling.

3. Evolution to Lower Precision

In traditional High-Performance Computing (like weather simulation), precision is everything (64-bit). In AI, the rules are inverted. Neural networks are surprisingly resilient to noise; they do not need 7 decimal places to decide if an image is a “Cat.” They just need “directional accuracy.”

By using fewer bits to represent a number (Quantization), we gain two massive advantages:

Memory Bandwidth: Sending a 4-bit number moves 8x faster over the wire than a 32-bit number.
Compute Density: You can pack 4x as many 4-bit calculators into the same silicon area.

4. The NVIDIA GPU: The “Simultaneous” Machine

The NVIDIA GPU architecture is defined by SIMT (Single Instruction, Multiple Threads).

In a traditional CPU (Sequential Processing), one core executes one instruction on one piece of data at a time. In a GPU, a single instruction controller drives thousands of cores simultaneously.

The Mechanism: The GPU groups threads into bundles that execute in lockstep. A single instruction drives thousands of active data lanes simultaneously, allowing the vast majority of the silicon budget to be spent on raw math.

The Legacy: This architecture was originally engineered for graphics to calculate the color of millions of independent pixels on a screen. However, this massive data parallelism proved mathematically indistinguishable from the needs of Deep Learning: performing the same matrix operation on millions of floating-point numbers simultaneously.

Over the last decade, NVIDIA has evolved this architecture through three defining eras, shifting the focus from “Graphics” to “AI First.”

The Ampere Era (A100)

The A100 was the chip that industrialized deep learning. It introduced the third-generation Tensor Core—a specialized sub-unit inside the GPU designed specifically to accelerate dense matrix math.

It established HBM2e as the standard for memory, delivering 1.5 TB/s of bandwidth.
The A100 remains the backbone of many inference fleets today, but its lack of native FP8 support limits its efficiency for modern Large Language Models (LLMs).

The Hopper Era (H100 & H200)

With the H100, NVIDIA realized that AI models were becoming resilient enough to run on lower precision. They introduced the Transformer Engine, which dynamically adjusts calculations to 8-bit (FP8) precision.

Lower precision effectively doubled the throughput for LLMs without increasing the chip size. The H100 delivered 4,000 TFLOPS of FP8 compute (vs. ~600 TFLOPS of FP16 on the A100).
The Memory Upgrade: The H100 utilized HBM3 (3 TB/s bandwidth). Its mid-cycle refresh, the H200, upgraded this to 141 GB of HBM3e running at 4.8 TB/s, allowing larger models to fit on a single chip.

The Blackwell Era (B200)

Blackwell represents the current frontier. It is not just a bigger chip; it is a platform designed to be stitched together.

Precision: It introduces native FP4 (4-bit) support, doubling the raw throughput again to 9,000+ TFLOPS (FP4).
Bandwidth: It features 8 TB/s of memory bandwidth and 1.8 TB/s of NVLink interconnect speed, essential for the rack-scale architectures we will discuss later.

5. The Google TPU: The “Assembly Line” Machine

While NVIDIA GPUs dominate the merchant market, Google charted a different course starting in 2015. They realized that standard GPUs, carrying the legacy baggage of graphics rendering, were too inefficient for their scale.

The result is the Tensor Processing Unit (TPU)—an ASIC designed from the ground up for one specific workload: matrix multiplication.

The Heart of the TPU: The Systolic Array

The defining feature of the TPU is the Systolic Array.

How it works: In a standard GPU, data is constantly moved from memory to registers for every calculation. In a Systolic Array, data flows through a massive grid of processing units like a “heartbeat” (systole). The output of one unit flows directly into the input of the next without writing back to memory.
The Advantage: This drastically reduces register access and power consumption, making the TPU inherently more power-efficient per operation than a general-purpose GPU.

The Lineage of the TPU

Google’s silicon has evolved through three distinct eras, each solving a different bottleneck:

v1 (2015): The Inference Engine. A simple, air-cooled chip designed solely to run search queries and AlphaGo. It could run models, but it could not train them.
v2/v3 (2017-2018): The Training Pivot. Google added High Bandwidth Memory (HBM) and floating-point capability, allowing TPUs to train models. TPU v3 introduced liquid cooling to the data center years before it was common in the merchant market.
v4 (2021): The Optical Era. This generation introduced Optical Circuit Switches (OCS), allowing 4,096 chips to be connected in a reconfigurable 3D Torus mesh. This architecture defined the modern “Pod” structure that challenges NVIDIA’s clusters today.
TPU v5p (2023): Expanded the pod size to 8,960 chips, doubling the per-chip bandwidth to support larger models.

Google realized that “one size fits all” was inefficient, so they split their silicon strategy.

TPU v5e / Trillium (v6e): Designed for high-volume inference (Search, YouTube) where performance-per-dollar matters most.
TPU v7 “Ironwood”: A massive, high-density chip designed to for training and inference of frontier models like Gemini.

Here is the revised “Key Performance Definitions” section for Article 2. It now consistently uses the B200 and TPU v7 for every example to allow for a direct head-to-head comparison.

6. Conclusion

Both NVIDIA and Google have pushed their respective architectures—the flexible GPU and the efficient TPU—to the absolute limit of what a single piece of silicon can do.

In the next article, “Article 3: Two Strategies for Maximum Density,” we will look at how engineers are breaking this physical barrier—using 3nm process nodes, Chiplet packaging, and 4-bit precision to build the monsters that power the Gigawatt Machine.

Two Paths to Gigawatt Machines

Tony Wan — Wed, 03 Dec 2025 03:45:20 GMT

1. Introduction

When AI Labs like OpenAI, Microsoft, Meta, and xAI announce gigawatt-scale data centers such as “Stargate,” “Fairwater,” “Prometheus,” and “Colossus,” they are describing a new class of computing: the AI Supercomputer.

These are not just large data centers; they are city-sized machines, purpose-built with hundreds of thousands of accelerators to train a single, massive AI model.

But what does an AI Supercomputer look like? There is no single answer. The industry has split into two competing philosophies on how to address the physics of scaling, driving two completely different physical anatomies.

Strategy A (NVIDIA): “Scale-Up the Node.” Increase the density of the chip and compress the computer into a dense monolith to make the wires shorter.
Strategy B (Google): “Scale-Out the Fabric.” Connect thousands of simpler chips into a massive, uniform optical mesh.

This 12-part series is a comprehensive guide to deconstructing these systems.

We will primarily examine the Google (TPU) and NVIDIA (GPU) ecosystem. This article explores the fundamental engineering trade-offs that define these two anatomies.

2. The Physics of the Problem: Density vs. Fabric

To train a trillion-parameter model, you must distribute the workload across thousands of chips. However, sending data between chips takes time (latency). To train faster, you must minimize this latency. Engineers have two levers to pull:

Lever 1: Extreme Density (The NVIDIA Strategy)

If you can pack more compute power into a smaller physical space, the electrons don’t have to travel as far.

The Tactic: NVIDIA pushed the limits of physics to increase per-chip density. By moving to 4-bit precision (FP4), the Blackwell B200 GPU delivers a massive 9x generational leap in performance.
The Consequence: A chip this powerful becomes a “gravity well” that demands massive amounts of data instantly. It requires a hierarchical, ultra-fast network to feed it.

Lever 2: Massive Fabric (The Google Strategy)

If you cannot compress the computer, you must build a faster network.

The Tactic: Historically, Google accepted lower per-chip density (using standard precision), choosing instead to spread the workload across a much larger physical footprint.
The Consequence: To make this “power sprawl” work, they built a massive, flat, optical network (ICI) that connects thousands of chips directly, making the distance between them virtually irrelevant.

3. Anatomy A: The “Super-Node” (NVIDIA)

NVIDIA’s pursuit of extreme density created the Hierarchical Architecture. The defining characteristic is the “Scale-Up Node”—a system where the atomic unit of the data center is no longer a server, but an entire rack.

The Physical Node: The GB200 NVL72

This is the new building block. It is not a server; it is a 120 kW, liquid-cooled, 72-GPU rack.

The Density: By compressing 72 GPUs into a single cabinet, NVIDIA keeps them close enough to connect via Copper Cables. This passive copper backplane saves ~20kW of power per rack compared to optical transceivers.
The Topology: The rack functions as one single, massive accelerator with a 31 TB unified memory pool.

The Network: Two-Tier Hierarchy

Because the node is so dense, the network must be hierarchical.

Tier 1 (Intra-Pod): The NVLink Fabric. A proprietary, copper-based, packet-switched network that connects up to 576 GPUs (8 racks) into a single “Pod” (NVLink Domain). Inside this Pod, bandwidth is massive (1.8 TB/s), allowing for efficient Tensor Parallelism (splitting a model across chips).
Tier 2 (Inter-Pod): The Scale-Out Fabric. To build a supercomputer (like “Stargate”), you connect hundreds of these Pods using a standard InfiniBand or Ethernet network.

The Look: A dense forest of “monoliths.” Extremely tall, heavy (3,700 lbs), liquid-cooled cabinets that require reinforced concrete floors and industrial-scale plumbing.

4. Anatomy B: The “Flat-Mesh” (Google)

Google and Amazon’s pursuit of fabric scale created the Flat-Fabric Architecture. The defining characteristic is the “Mesh”—a massive, continuous grid of accelerators.

The Physical Node: Abstracted

The physical node (a server tray with 4-8 chips) is small and architecturally irrelevant. The true unit of scale is the Pod.

The Network: Single-Tier Mesh

Instead of a hierarchy, the system is designed as one massive, uniform web.

Tier 1 (Intra-Pod): The ICI Fabric. Google uses Optical Circuit Switches (OCS) to connect 8,960 TPUs (in a v5p Pod) into a 3D Torus Mesh.
The Difference: In this mesh, every chip connects directly to its neighbors using optical fibers. There is no central switch. The “Pod” is 15x larger than NVIDIA’s, meaning massive workloads can run without ever hitting a slower Tier-2 network.

The Look: A sprawling “field.” Rows and rows of standard-height racks connected by a visible canopy of yellow optical fibers (the OCS fabric).

5. The “Brains” of the Anatomy: Software Stacks

The software must match the shape of the hardware. The divergence in network topology (Hierarchy vs. Mesh) forces a divergence in software strategy.

NVIDIA: Dynamic Runtime Orchestration (Library-Based)

NVIDIA’s network is packet-Switched and dynamic. Data traffic is unpredictable and bursty.

The Software: CUDA / cuDNN.
The Strategy: Runtime Flexibility. NVIDIA uses a “toolbox” of pre-compiled libraries. When congestion happens, smart switches and software (NCCL) adapt in real-time, routing packets around traffic jams. This “eager execution” model offers maximum flexibility for researchers.

Google: Deterministic Static Scheduling (Compiler-Based)

Google’s network is Circuit-Switched and static. The OCS mirrors must be physically pointed to the right destination.

The Software: XLA (Compiler).
The Strategy: Compile-Time Scheduling. The XLA compiler analyzes the entire AI model before it runs. It pre-calculates the exact path of every data packet and orchestrates a perfect, collision-free flow. It doesn’t react to traffic; it prevents it. This offers maximum efficiency for known, massive workloads.

6. The Industrial Reality: Gigawatt Infrastructure

Regardless of the anatomy chosen, the sheer scale of these systems has forced a transition from “Data Center” to “Industrial Plant.”

Power: A 100,000-chip cluster consumes Gigawatts of power—roughly the output of a nuclear reactor powering a city. The primary engineering challenge shifts from IT administration to grid-scale energy logistics.
Cooling: Managing 120kW per rack (NVIDIA) or massive mesh density (Google) makes air cooling physically impossible. The facility becomes a massive hydraulic system, circulating millions of gallons of coolant to manage thermal loads.

7. Conclusion: The Asymmetric Shift

The story of the next generation is one of divergence: NVIDIA stays the course, while Google pivots.

NVIDIA has remained consistent: build the most powerful, dense node possible, and then arrange those nodes in a hierarchy.

The Evolution: They haven’t changed their philosophy; they’ve just scaled the physics. They went from an 8-GPU node (DGX) to a 72-GPU node (GB200 NVL72), creating a “Super-Node” that is 9x more powerful.² They accept the complexity of a two-tier network (NVLink + InfiniBand) as the necessary cost of this extreme density.

Google, however, has altered its silicon strategy.

The Old Way: For years, Google relied on a “Fabric-First” approach—using massive meshes of moderately powerful chips (TPU v4/v5).
The Pivot: With the TPU v7 (”Ironwood”), Google effectively admitted that fabric scale alone is no longer sufficient. By driving a 10x leap in per-chip performance, they are also chasing density now, attempting to combine both strategies: NVIDIA-class per-chip density deployed on a Google-class flat optical mesh.

As we enter the Gigawatt era, the architectural battle lines are drawn.

NVIDIA bets that the hierarchical super-node (the 120kW Rack) is the ultimate building block.
Google bets that a dense flat mesh (9,000+ high-power chips) can eliminate the hierarchy entirely.

In the next article, “Article 2: Two Silicon Foundations for Scale,” we will zoom into the silicon die itself, exploring the engines that power these competing visions.

Welcome to The Gigawatt Machine Series

Tony Wan — Wed, 03 Dec 2025 03:16:51 GMT

The Era of the Gigawatt Machine

When companies like OpenAI, Microsoft, Meta, and xAI announce their next-generation infrastructure, they are no longer describing data centers. They are describing AI Supercomputers—city-scale industrial machines that function as a single, unified brain.

We are witnessing a profound shift in industrial engineering. The “Gigawatt-scale” machine is not just a larger version of a traditional server farm; it is a fundamentally different class of system. To build one, engineers must solve immutable laws of physics—latency, heat, and yield.

We are witnessing an industrial arms race to build AI Supercomputers. But there is no single blueprint for how to build one. The industry has split into two competing philosophies, largely driven by two dominant companies in AI hardware: NVIDIA and Google.

Why We Study the Rivalry

We are using the rivalry between NVIDIA and Google not just to compare specs, but as a lens to better understand the AI Supercomputer at a system level.

Engineers at both companies are solving for the same things—scale, latency, heat, and yield—but they have contrasting business needs and different starting points.

NVIDIA: Solving for Flexibility. They must build a general-purpose platform that works for everyone (OpenAI, Meta, Microsoft).
Google: Solving for Efficiency. They build for themselves (Search, Gemini) and select strategic partners. They can prioritize cost-efficiency over flexibility.

This produces two very different architectures.

The NVIDIA Way: A hierarchical architecture designed for flexibility. It scales up by building massive, powerful nodes that act as “super-chips,” designed to serve the diverse needs of the global AI market.
The Google Way: A flat architecture designed for efficiency. It scales out by building a massive, uniform mesh of specialized chips, optimized to run specific internal workloads with ruthless cost-efficiency.

Understanding these two distinct approaches elucidates the fundamental concepts of the system itself.

What is This Series?

Our mission is to deconstruct these machines at a systems level. We treat the training of trillion-parameter models not as a software problem, but as a systems engineering challenge that spans from the silicon transistor to the facility’s cooling towers.

This is a 12-article master class that builds an AI Supercomputer from first principles. Rather than getting lost in the battle of raw specifications, we focus on the interplay of hardware and software:

The Building Blocks: How individual accelerators are fused into Nodes, how Nodes are networked into Pods, and how Pods are interconnected into Supercomputers.
The Software Synergy: How the choice of software—whether a flexible library or a predictive compiler dictates the physical design of the network fabric itself.

Who Is This For?

In the current landscape, technical information is predominantly found at two extremes, creating a critical knowledge gap:

Too High-Level: Marketing decks that use buzzwords without explaining the mechanics.
Too Low-Level: Dense vendor documentation or academic papers lost in minutiae.

This series bridges that gap. It is written for the professionals such as Executives, Marketing and Sales Leaders, Investors, Architects, Program Managers, Supply Chain Managers, and Engineers. People who need to understand the “Why” and “How” of system architecture.

The Syllabus: A 12-Part Comparative Journey

We have consolidated our curriculum into four logical phases. Each phase explores a layer of the stack, highlighting how NVIDIA and Google diverged to solve the same problem.

Phase I: The Hardware (Building the Machine)

Focus: Deconstructing the physical systems from the chip to the rack.

Article 1: The Two Anatomies of Scale
- NVIDIA’s Hierarchical “Super-Node” vs. Google’s Flat “Optical Mesh.”
Article 2: The Silicon Engine
- NVIDIA Blackwell vs. Google TPU v7.
Article 3: The Compute Node
- The shift from the 8-GPU Server (DGX) to the 72-GPU Rack (NVL72) vs. the Virtual Pod (TPU).
Article 4: The Laws of Physics
- Why both architectures—despite their differences—have been forced to adopt liquid cooling at the Gigawatt scale.

Phase II: The Fabric (Connecting the Machine)

Focus: The network topologies that turn isolated racks into a supercomputer.

Article 5: The Tier-1 Fabric (Inside the Pod)
- NVIDIA’s NVLink (Copper/Electrical) vs. Google’s ICI (Optical/Circuit-Switched).
Article 6: The Tier-2 Fabric (The Scale-Out Layer)
- NVIDIA’s InfiniBand/Ethernet vs. Google’s Jupiter Data Center Network.
Article 7: Traversing the Fabric
- Following data as it traverses the hierarchical NVIDIA network vs. the flat Google mesh.

Phase III: The Workload (Animating the Machine)

Focus: How software bridges the gap between math and silicon.

Article 8: The Software Ecosystem
- NVIDIA’s Library-Based stack (CUDA/cuDNN) for flexibility vs. Google’s Compiler-Based stack (XLA/JAX) for efficiency.
Article 9: Parallelism
- How Tensor, Pipeline, and Data Parallelism map differently to hierarchical vs. flat hardware.
Article 10: Orchestration & Storage
- Managing data loading at the exabyte scale.

Phase IV: The Facility (Housing the Machine)

Focus: The industrial reality of the Gigawatt era.

Article 11: The Gigawatt Facility
- Power, piping, and concrete. How the data center building itself must change to support 120kW racks.
Article 12: Conclusion: The Road to Zettaflops
- Where do we go when we hit the limits of copper, optics, and the power grid?

Ready to Build?

Each article is designed to be a concise, system-level deep dive. We avoid getting bogged down in minutiae to keep our eyes on the big picture: the systems engineering of intelligence.

Let’s begin with Article 1, “The Two Anatomies of Scale.”