NVIDIA & xAI build supercomputer to power AI training
NVIDIA has announced that the xAI Colossus supercomputer cluster, equipped with 100,000 NVIDIA Hopper Tensor Core GPUs, has utilised the NVIDIA Spectrum-X Ethernet networking platform to achieve massive scale.
The Colossus supercomputer, located in Memphis, Tennessee, is tasked with training xAI's Grok large language models that are a feature for X Premium subscribers. NVIDIA and xAI completed the construction of this state-of-the-art supercomputer in 122 days, a significantly shorter timeframe compared to the standard period, which can extend to several months or even years. Remarkably, it took just 19 days from the initial installation of the first server rack until the commencement of training operations.
During the training process of the Grok model, the Colossus system has demonstrated exceptional network performance. Notably, it has reported zero application latency degradation and no packet loss resulting from flow collisions. By leveraging NVIDIA's Spectrum-X congestion control, the system has maintained a high data throughput of 95%. In contrast, standard Ethernet solutions typically experience numerous flow collisions and can only deliver 60% data throughput.
Elaborating on the technological advancements, Gilad Shainer, Senior Vice President of Networking at NVIDIA, stated, "AI is becoming mission-critical and requires increased performance, security, scalability, and cost-efficiency. The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions."
Elon Musk commented on the achievement via X, describing Colossus as "the most powerful training system in the world," and acknowledged the collaborative efforts, "Nice work by xAI team, NVIDIA and our many partners/suppliers."
A spokesperson for xAI remarked on the capabilities of their new infrastructure, saying, "xAI has built the world's largest, most-powerful supercomputer. NVIDIA's Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive-scale, creating a super-accelerated and optimised AI factory based on the Ethernet standard."
The Spectrum-X platform, central to enabling this high-performance networking, incorporates the Spectrum SN5600 Ethernet switch. This technology supports port speeds up to 800Gb/s and is powered by the Spectrum-4 switch ASIC. For enhanced performance, xAI opted to use the Spectrum-X SN5600 switch paired with NVIDIA BlueField-3 SuperNICs.
Additionally, the Spectrum-X Ethernet networking introduces advanced features such as adaptive routing via NVIDIA Direct Data Placement technology, congestion control, and improved AI fabric visibility and performance isolation. These features are essential for supporting multi-tenant generative AI clouds and extensive enterprise environments, previously achievable only with InfiniBand.