Story image

Google Cloud TPU machine learning accelerators now available in beta

13 Feb 18

Google has made its Cloud TPUs available in beta on Google Cloud Platform (GCP) to help machine learning experts train and run their ML models faster.

Google defines its cloud TPUs (tensor processing unit) as hardware accelerators that are optimised to speed up and scale up specific ML workloads programmed with TensorFlow.

Each Cloud TPU is built with four custom ASICs, and provides up to 180 teraflops of floating-point performance and 64 GB of high-bandwidth memory onto a single board.

The boards can be used alone or connected via an ultra-fast, dedicated network to form multi-petaflop ML supercomputers called “TPU pods.”, Google explained in a blog post yesterday.

Google stated that it will offer these larger supercomputers on GCP later in the year.

“We designed Cloud TPUs to deliver differentiated performance per dollar for targeted TensorFlow workloads and to enable ML engineers and researchers to iterate more quickly,” Google said on its blog. The company elaborated on this with three examples:

  • Instead of waiting for a job to schedule on a shared compute cluster, you can have interactive, exclusive access to a network-attached Cloud TPU via a Google Compute Engine VM that you control and can customise
  • Rather than waiting days or weeks to train a business-critical ML model, you can train several variants of the same model overnight on a fleet of Cloud TPUs and deploy the most accurate trained model in production the next day
  • Using a single Cloud TPU and following this tutorial, you can train ResNet-50 to the expected accuracy on the ImageNet benchmark challenge in less than a day, all for well under $200

ML model training

Google’s Cloud TPUs can be programmed with high-level TensorFlow APIs, and the company has open-sourced a set of reference high-performance Cloud TPU model implementations.

Google plans to open-source additional model implementations over time.

“Adventurous ML experts may be able to optimise other TensorFlow models for Cloud TPUs on their own using the documentation and tools we provide,” Google added.

Google will introduce TPU pods later this year which will improve the time-to-accuracy of Cloud TPUs.

“Both ResNet-50 and Transformer training times drop from the better part of a day to under 30 minutes on a full TPU pod, no code changes required,” the blog detailed.

Two Sigma chief technology officer and former senior Google engineer Alfred Spector comments, “We made a decision to focus our deep learning research on the cloud for many reasons, but mostly to gain access to the latest machine learning infrastructure.”

“Google Cloud TPUs are an example of innovative, rapidly evolving technology to support deep learning, and we found that moving TensorFlow workloads to TPUs has boosted our productivity by greatly reducing both the complexity of programming new models and the time required to train them.”

Spector concludes, “Using Cloud TPUs instead of clusters of other accelerators has allowed us to focus on building our models without being distracted by the need to manage the complexity of cluster communication patterns.”

The new world of edge data centre management
Schneider Electric’s Kim Povlsen debates whether the data centre as we know it today will soon cease to exist.
Can it be trusted? Huawei’s founder speaks out
Ren Zhengfei spoke candidly in a recent media roundtable about security, 5G, his daughter’s detainment, the USA, and the West’s perception of Huawei.
SUSE partners with Intel and SAP to accelerate IT transformation
SUSE announced support for Intel Optane DC persistent memory with SAP HANA.
Inspur uses L11 rack level integration to deploy 10,000 nodes in 8 hours
Inspur recently delivered a shipment of rack scale servers of more than 10,000 nodes to the Baidu Beijing Shunyi data center within 8 hours.
How HCI helps enterprises stay on top of data regulations
Increasing data protection requirements will supposedly drive the demand for Hyper-Converged Infrastructure solutions across the globe.
Vodafone and PNSol champion new ‘invisble network’ broadband project
"As an industry, we've increased the speed of broadband to one gigabit and beyond, which is a remarkable achievement, but we now have to look beyond speed."
Top 3 cloud computing predictions – what’s in store for 2019?
Virtustream's Deepak Patil shares his predictions for how cloud computing will evolve in 2019.
Rubrik welcomes $261m funding for new market expansion
The company intends to use the funds from new investor Bain Capital Ventures will go toward future innovation and expansion.