Google’s scalable supercomputers now publicly available

Wed, 8th May 2019

FYI, this story is more than a year old

In what it says is a bid to accelerate the largest-scale machine learning (ML) applications deployed today, Google has opened up its supercomputers.

The global tech giant has created silicon chips called Tensor Processing Units (TPUs), which when assembled into multi-rack ML supercomputers called Cloud TPU Pods can complete ML workloads in minutes or hours that previously took days or weeks on other systems.

Now, Google Cloud TPU v2 Pods and Cloud TPU v3 Pods are publicly available in beta to help ML researchers, engineers, and data scientists iterate faster and train more capable machine learning models.

“Google Cloud is committed to providing a full spectrum of ML accelerators, including both Cloud GPUs and Cloud TPUs. Cloud TPUs offer highly competitive performance and cost, often training cutting-edge deep learning models faster while delivering significant savings,” says Google Brain Team Cloud TPUs senior product manager Zak Stone.

The benefits for ML teams building complex models and training on large data sets, Stone says, include shorter time to insight, higher accuracy, frequent model updates, and rapid prototyping.

“While some custom silicon chips can only perform a single function, TPUs are fully programmable, which means that Cloud TPU Pods can accelerate a wide range of state-of-the-art ML workloads, including many of the most popular deep learning models,” says Stone.

“Cloud TPU customers see significant speed-ups in workloads spanning visual product search, financial modeling, energy production, and other areas. In a recent case study, Recursion Pharmaceuticals iteratively tests the viability of synthesized molecules to treat rare illnesses. What took over 24 hours to train on their on-prem cluster completed in only 15 minutes on a Cloud TPU Pod.

According to Stone, a single Cloud TPU Pod can contain more than 1,000 individual TPU chips which are connected by an ultra-fast, two-dimensional toroidal mesh network. The TPU software stack then uses this mesh network to enable many racks of machines to be programmed as a single, giant ML supercomputer via a variety of flexible, high-level APIs.

“The latest-generation Cloud TPU v3 Pods are liquid-cooled for maximum performance, and each one delivers more than 100 petaFLOPs of computing power. In terms of raw mathematical operations per second, a Cloud TPU v3 Pod is comparable with a top 5 supercomputer worldwide (though it operates at lower numerical precision),” says Stone.

“It's also possible to use smaller sections of Cloud TPU Pods called ‘slices.' We often see ML teams develop their initial models on individual Cloud TPU devices (which are generally available) and then expand to progressively larger Cloud TPU Pod slices via both data parallelism and model parallelism to achieve greater training speed and model scale.