DataCenterNews Asia Pacific - Specialist news for cloud & data center decision-makers
Datadog

Datadog launches GPU Monitoring to cut AI compute costs

Thu, 23rd Apr 2026 (Today)

Datadog has launched GPU Monitoring worldwide for organisations trying to manage rising AI computing costs.

The tool gives developers and machine learning engineers a single view of GPU use across infrastructure, workloads and application performance. It is intended to help teams identify idle resources, trace bottlenecks affecting model training, and relate hardware metrics to application behaviour.

Growth in AI development has driven a surge in demand for graphics processing units, which are used to train and run large models. That demand has also exposed operational issues for companies that have assembled a patchwork of GPU resources without clear oversight of utilisation, cost allocation or performance problems.

Datadog says GPU instances account for 14 per cent of compute costs, making them a significant expense for businesses expanding their AI work. Many organisations can see costs rising but still struggle to charge spending back to business units, understand workload context, or identify where efficiency gains can be made.

Existing GPU management tools often show device-level health data but do not explain why training and inference jobs slow down or fail, or which devices are sitting idle. As a result, engineering teams may overprovision hardware as a precaution, pushing spending higher.

Cost pressure

The new product aims to address those gaps by linking telemetry from GPU fleets to the workloads using those resources. This shared view is intended to help platform engineering and machine learning teams investigate problems together instead of relying on separate dashboards and data sources.

Users can forecast demand from fleet usage patterns and assess whether they need to buy additional GPUs or free existing capacity. The product can also help detect unhealthy GPUs before failures spread across clusters and disrupt training or inference jobs.

Yanbing Li, Chief Product Officer at Datadog, said the economics of AI infrastructure have become a broader management issue. "GPU instances account for 14 percent of compute costs-which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways. While these companies can see their costs climbing, they can't chargeback GPU spend across business units, see workload context or identify clear next steps for improvement. As a result, it is very challenging to budget and plan in thoughtful ways," Li said.

Li said fragmented visibility remains a central problem for many businesses. "Smartly managing AI spend becomes a board-level conversation when capacity is misallocated, training and inference workloads stall, and costs escalate. We all know managing GPU costs is a huge problem we need to solve, but most companies are experimenting with solutions and it is still very difficult to get a single view of what is happening across the stack. GPU Monitoring fixes that with efficiency and reliability that we haven't seen before," she said.

Customer view

Hyperbolic, which operates multi-tenant GPU infrastructure, is among the users Datadog cited. According to the company, the product has given Hyperbolic visibility at both instance and device level across utilisation, memory, power and thermal measures.

Kai Huang, Head of Product at Hyperbolic, described the operational benefits of consolidating monitoring in one system. "Datadog GPU Monitoring has made it easy for us to stay on top of our multi-tenant GPU infrastructure. We get per-instance, per-device visibility into core utilisation, memory, power and thermals right out of the box with no extra setup. The dashboards are rich out of the gate and simple to customise, and standing up isolated views per customer takes minutes," Huang said.

He added that combining infrastructure data with model-level observability reduces the need to switch between tools during incident investigations. "Layering on LLM Observability ties it all together. We can go from a model latency spike straight to the underlying GPU metrics without switching tools. Full stack AI observability in one platform means both our team and our customers can move faster with confidence."

The launch adds to a growing market for software that tracks the cost and performance of AI infrastructure as companies try to turn experimental model development into repeatable production systems. With GPU supply still expensive and unevenly distributed, tools that help reclaim underused capacity are becoming as important to AI operations teams as the chips themselves.