DataCenterNews Asia Pacific - Specialist news for cloud & data center decision-makers
Flux result 7b825e1d f546 4b6e 80f0 7fae8374c2d9

Google unveils Virgo Network for next-gen AI workloads

Thu, 23rd Apr 2026 (Today)

Google has introduced Virgo Network, a new data centre fabric for large-scale AI workloads. It is designed to connect AI accelerators across pods and sites as part of the company's AI Hypercomputer infrastructure.

Virgo Network targets the scale-out side of AI networking, handling east-west traffic between accelerators during model training and serving. Its design reflects a shift away from general-purpose data centre networks as larger AI models place heavier demands on bandwidth, latency and fault tolerance.

The architecture has three layers: a scale-up domain for communication within a pod, a separate scale-out fabric for accelerator-to-accelerator RDMA traffic across pods, and the Jupiter front-end network for north-south traffic to storage and general-purpose compute resources.

This separation allows each layer to be updated independently. It also reserves dedicated bandwidth for training traffic while keeping access to storage and compute on a distinct front-end fabric.

Flat Design

At the centre of the system is a flat, two-layer non-blocking topology built on high-radix switches. This reduces the number of network tiers compared with conventional data centre designs, cutting latency and supporting a larger unified compute domain.

Virgo Network uses a multi-planar design with independent control domains. Accelerator racks also connect to the Jupiter network, which provides access to distributed storage and compute services and can be used across multiple sites for very large training runs.

The fabric can link 134,000 TPU 8t chips in a single network and deliver up to 47 petabits a second of non-blocking bisection bandwidth.

Google also said the new fabric provides up to four times the bandwidth per accelerator of the previous generation. According to the company, unloaded fabric latency for TPUs is 40% lower.

Scale Pressure

The launch reflects the pressure AI workloads are placing on data centre infrastructure. Training jobs now extend beyond the power and space limits of a single data centre, while traffic patterns include intense bursts at millisecond intervals that can create congestion and expose weak points in older network designs.

Low latency has also become more important for AI inference. Serving workloads need consistent response times, and training clusters can be slowed by a single underperforming node if the network cannot manage synchronised traffic effectively.

Virgo Network was built to improve “goodput” - a measure of useful work completed by a machine learning workload - rather than focusing only on raw throughput. Fault isolation, observability, and faster mitigation of hangs and stragglers are central to that goal.

Reliability Focus

As systems now span hundreds of thousands of chips, hardware faults are treated as unavoidable rather than exceptional. Independent switching planes in Virgo Network are intended to isolate local failures so they do not degrade performance across an entire cluster.

The software and orchestration stack are designed to increase mean time between interruptions and reduce mean time to recovery. Google uses sub-millisecond telemetry to monitor network behaviour, detect transient congestion, manage buffers, and trace the source of slowdowns across hardware and software.

Google also highlighted automated tools for detecting stragglers and hangs. These identify nodes that are either running more slowly than expected or have stopped responding, allowing bottlenecks to be isolated more quickly during training jobs.

The network forms part of Google's broader effort to tailor infrastructure to successive generations of AI accelerators. Its machine learning systems and network architecture are developed together so the fabric matches the hardware it supports.

Virgo Network is positioned as the networking base for Google's next generation of accelerator systems, with the company arguing that traditional designs are no longer well suited to the scale, traffic patterns and latency demands of modern AI workloads.