DataCenterNews Asia logo
Specialist data center news for Asia
Partner content
Story image

New performance evaluation method helps realize computing power for hyperscale cloud clusters

By Contributor
Tue 27 Apr 2021
FYI, this story is more than a year old

Article by Intel general manager of data center management solutions Jeff Klaus.
 

Today’s cloud-based internet services are commonly hosted in hyperscale data centers with a large fleet of computers. A hardware upgrade, a software update, or even a system configuration change can be costly in these types of environments. 

For this reason, the potential performance impact needs to be thoroughly evaluated so that IT staff can decide on accepting or rejecting the change. An understanding the evaluation results, including the root cause of performance change, is important for further system optimization and customization.

Evaluating the performance in the scale of a cluster has long been a difficult challenge. Traditional benchmark-based performance analysis conducts load testing in an isolated environment. However, this type of performance analysis cannot represent the behavior of a variety of co-located workloads with varying intensities in the field.

Alibaba, a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and government organizations in more than 200 countries, developed the System Performance Estimation, Evaluation and Decision (SPEED) platform to address this very challenge. 

SPEED is the data center performance analysis platform that handles a mix of businesses, including eCommerce, Big Data, and colocation in production. SPEED has been used for years to ensure software and hardware upgrades at scale and cluster resource stability. It also provides the data-driven foundation for critical decision-making involving technology evaluation and capacity planning. 

Intel Platform Resource Manager (Intel PRM) is a suite of software packages to help monitor and analyze performance in a large scale cluster. The tool supports the collection of system performance data and platform performance counters at the core level, the container level, and the virtual machine (VM) level. 

The suite contains an Agent (eris agent) to monitor and control platform resources (CPU cycle, last level cache, memory bandwidth, etc.) on each node, and an Analysis tool (analyze tool) to build a model for platform resource contention detection. Some derived metrics depend on certain unique performance counters in Intel Xeon Scalable processors.

A key metric of SPEED, resource usage effectiveness (RUE), measures the resource consumed for each piece of work done, for example, a transaction completed in an eCommerce workload. Since the RUE metric is monitored for each workload instance, it is preferred that the resource consumed is measured using the system metrics in a low-overhead mechanism. 

As Jianmei Guo, an Alibaba staff engineer and the leader of the SPEED platform, explained, “RUE requires reliable measurements for resource utilization, and as a server provider, Intel brings in tools and methodology to work with Alibaba on the hyperscale cluster performance evaluation. Intel and Alibaba worked together on strengthening the data collection and analysis in SPEED.” 
 

Solving the challenges of performance evaluation 

We start from sampling RUE per instance in the cluster, then aggregate the samples to a workload-level metric after removing the bias from the varying workload intensities, and eventually derive the performance results. 

For a better understanding of the performance speedup, we speculate the potential cause of the performance change and select auxiliary metrics derived from the platform performance counters or the system statistics. A further analysis on the auxiliary metrics is used to strengthen the empirical justification of performance change in addition to the primary analysis of RUE.

In traditional offline performance analysis, small-scale benchmark workload is often stressed to peak or a certain fixed intensity by load generator in an isolated environment. This type of benchmark is simple, reproducible, and useful for offline performance profiling. 

However, in a cloud computing environment with a large number of workloads co-located together, the workload behaviors are quite diverse due to the varying intensities and possible micro-architectural interferences between workloads, making this load-testing approach less practical.
 

To solve these shortcomings, a new performance evaluation method involves the following step-by-step procedures: 

1. Propose a system change for performance improvement and speculate why and how the system change impacts the performance from a certain perspective.

2. Monitor multiple instances of important workloads in the cluster for workload throughput and resource utilization before and after applying the system change, and calculating the baseline average RUE (primary metric) and the new average RUE for each job. During the calculation, the bias of RUE with respect to the varying workload intensity is identified and eliminated through a regression approach.

3. Calculate the job level performance speed-up using the baseline average RUE and the new average RUE of the job. In the case where each high importance job is assigned with an importance weight, a cluster level speed-up can be aggregated with a weighted average across the jobs. 

4. Based on the speculation in Step 1, define the relevant auxiliary metrics that can be used as the evidence to strengthen the results of speed-up in Step 3. Monitor and calculate the auxiliary metrics for a sampled set of instances before and after applying the system change.

5. Calculate the job level metric change using the baseline auxiliary metrics and the new auxiliary metrics. If the metric change complies with the speculation, the performance speed-up is regarded to be strengthened from the specific perspective speculated in Step 1.
 

Just recently, Intel released a white paper featuring two case studies on this new performance evaluation method. The first case study evaluates the performance improvement from a change of core pinning setting. Before the change, two containers running instances of a job are not pinned to the processors. Instead, their CPU resource usages are simply limited by the CPU quota assignment. 

During running, the Linux CPU scheduler moves the worker threads of a workload across the processor boundary. With current NUMA architecture and default first-touch NUMA policy, such a movement introduces a higher latency in memory access.

The change applied is to pin the two containers to the two processors, respectively. After applying the change, we speculate the workload performance will improve due to micro-architectural performance gain through reducing the amount of remote memory access.

The second case study evaluates the performance impact of replacing Pouch, an open source container engine created by Alibaba, with the Kata container in a workload colocation context. A container created from Pouch is a standard Linux container from the performance perspective.

Kata is a secure container based on the lightweight VM mechanism offering a better security context. After replacing Pouch with Kata, we speculated that the performance may be impacted due to the overheads in virtualization, such as the translation of system calls in virtualization, an additional layer of page table translation in the guest OS.

As Alibaba’s Jianmei Guo said, “The reason we concern ourselves with performance in Alibaba is that we aim at cost reduction while continuously improving performance through constant technological innovation. 

“We’re happy to work with Intel to support the performance monitoring and analysis challenge in Alibaba SPEED with Intel PRM.”

Related stories
Top stories
Story image
Employment
Tech job moves - Forcepoint, Malwarebytes, SolarWinds & VMware
We round up all job appointments from May 13-20, 2022, in one place to keep you updated with the latest from across the tech industries.
Story image
Sustainability
Intel unveils new investments for data center sustainability
Intel has announced two new investments, continuing its efforts to create more sustainable data center technology.
Story image
Digital Transformation
The Huawei APAC conference kicks off with digital transformation
More than 1500 people from across APAC have gathered for the Huawei APAC Digital Innovation Congress to explore the future of digital innovation.
Story image
Sustainability
Legrand unveils Nexpand, a data center cabinet platform
Legrand has unveiled a new data center cabinet platform, Nexpand, to offer the necessary scalability and future-proof architecture for digital transformation.
Story image
Hyperscale
SpaceDC partners with Aofei for data center sales in Asia
SpaceDC has partnered with Aofei Data International to sell Aofei's data centers, CDN and SDN in China.
Story image
Research
New strategies for cloud-native attacks - Aqua Security
New research from Aqua Security reveals attackers are using more sophisticated techniques to target cloud-native environments.
Story image
Disaster Recovery
Kacific launches emergency connectivity offering, CommsBox
Kacific has announced the release of a new emergency connectivity offering designed to rapidly provide broadband service in emergency or disaster zones.
Story image
Infrastructure
Report - Data investment the key to better business growth
New research from Digital Realty has revealed that almost half (47%) of IT leaders globally believe their business investment in data systems and infrastructure is a key obstacle or concern.
Story image
APAC
Odaseva expands in APAC and UK with more security features
Odaseva, a data platform for Salesforce, is establishing new headquarters in London as well as a new data center in India.
Story image
Sustainability
AyalaLand and FLOW partner for data center development
AyalaLand Logistics Holdings Corp (ALLHC) and FLOW Digital Infrastructure have entered into a framework agreement to bolster the development of carrier-neutral data centers in the Philippines.
Story image
Microsoft
SAS Viya on Microsoft Azure to deliver 204% return - study
The Forrester Total Economic Impact study finds SAS Viya on Microsoft Azure brings a 204% return on investment over three years.
Story image
Sustainability
Video: 10 Minute IT Jams - SoftIron CMO on Data Center Sustainability
In a special Power/Energy feature week presentation, we are joined by SoftIron CMO Andrew Moloney.
Story image
Telstra
Telstra expands business offerings in the Philippines
The expansion aims to offer more choice for customers and enhance connectivity into the Philippines, and within the country.
Story image
Cybersecurity
A10 Networks finds over 15 million DDoS weapons in 2021
A10 Networks notes that in the 2H 2021 reporting period, its security research team tracked more than 15.4 million Distributed Denial-of-Service (DDoS) weapons.
Story image
Sustainability
Power at the edge: the role of data centers in sustainability
The Singaporean moratorium on new data center projects was recently lifted, with one of the conditions being an increased focus on power efficiency and sustainability.
Story image
Sustainability
Siemens showcases new automated solutions for data centers
Siemens has implemented new automated solutions and AI in the Baltic region's largest data center, providing insight into the future of data center management.
Story image
Akamai
Akamai announces new products across security, computing
Akamai has announced a series of new products and updates to existing products across its security and compute product lines, including its entry into the infrastructure as a service (IaaS) market.
Story image
BitTitan
Why tenant consolidation is critical to cloud success
Consolidating tenants can improve cost management, security and engagement after a flurry of reactive activity following the widespread shift to remote operations.
Story image
Data Center
Digital Edge to build South Korea's largest commercial data center
The project will be the largest commercial data center project in South Korea with total IT power of 120MW and a capital investment of more than KWR$1 trillion.
Story image
Red Hat
Red Hat expands capabilities to provide streamlined application development in cloud
"Application development is undergoing significant change and developers need tools to support this transformation."
Story image
Digital Transformation
EdgeConneX enters Indonesia, plans for data center campus
EdgeConnex has announced it is expanding its presence in Asia with the acquisition of GTN Data Center in Indonesia.
Story image
Power / Energy
DigitalBridge makes $30 million equity investment in LEDC
Leading Edge Data Centres (LEDC) has announced it has secured an AUD$30 million equity investment in its regional edge network from an affiliate of DigitalBridge Group, DigitalBridge.
Story image
Tech Data
Tech Data to use Pluribus Networks’ cloud solutions in APAC
Tech Data says using Pluribus Networks' Unified Cloud Fabric solution will be a "game-changer" for its data center infrastructure customers and partners.
Story image
Storage
Energy storage demand momentum continues, says BYD
BYD has announced an expansion of its production capacities and will deliver 250,000 units of its energy storage system, BYD Battery-Box Premium.
Story image
Data Center
Tier III Ready Datacenter solutions shortlisted for major awards
"These designs will accelerate data center clients' own Tier III certification, reduce the cost, and fast-track their time to market."
Story image
Data Center
Preventing downtime costs and damage with Distributed Infrastructure Management
Distributed Infrastructure Management (DIM) can often be a lifeline for many enterprises that work with highly critical ICT infrastructure and power sources.
Story image
Cloud
Colt connectivity with AWS increases services in Asia
Colt Technology Services expands cloud connectivity to AWS Direct Connect Hosted services, with speeds of up to 10 Gbps in Asia.
Story image
Databricks
Databricks grows in APAC market, expands into Korea
Databricks officially launches a local office in Seoul, Korea, building on existing partnerships with Cloocus, Megazone and the Weverse Company
Story image
Sustainability
RDA and MVGX partner for sustainable data center development
Red Dot Analytics (RDA) and MetaVerse Green Exchange (MVGX) have entered a strategic partnership to make Singapore's data center development and operations more sustainable.
Story image
Sustainability
YTL unveils development of solar-powered data center campus
YTL Power (YTL) has announced the development of a 500MW data center campus in Johor, the first data center park in Malaysia to be powered by solar energy.
Story image
Data Center
Equinix enters Africa, closing US$320 million acquisition of MainOne
The completion of the acquisition augments Equinix's long-term strategy to become a leading African carrier-neutral digital infrastructure company.
Story image
Cable
New high-performance cable in the works for Asia
A new high-performance submarine cable is being built to enhance connectivity between Hong Kong, China and Southeast Asia.
Story image
Digital Transformation
Multiplex, NEXTDC making strong progress on S3 data centre
Multiplex has made a significant achievement on Stage 1 of NEXTDC’s S3 data centre, ‘topping out’ the structure in the Artarmon on Sydney’s lower North Shore.
Softiron
For every 10PB of storage run on HyperDrive vs. comparable alternatives, an estimated 6,656 tonnes of CO₂ are saved by reduced energy consumption alone over its lifespan. That’s the equivalent of taking nearly 1,500 cars off the road for a year.
Link image
Story image
Sustainability
AirTrunk boosts Japan presence with West Tokyo data center
AirTrunk is planning to build TOK2, a new hyperscale data center in Japan which will strengthen the company’s presence in the country.
Story image
Talend
Talend introduces new data health solutions for businesses
Talend has announced its latest version of Talend Data Fabric, with the release of Talend Trust Score enabling data teams to establish a foundation for data health.
Story image
Surveillance
Genetec launches new enclosure management system for data centers
Genetec has released a new enclosure management solution that will give data centers the ability to secure, monitor and manage access to racks and cabinets remotely.
Story image
SaaS
Cisco reveals new tech, intends to prevent network issues
Cisco has revealed new technology intended to mitigate costly disruptions by aiding IT teams in learning, predicting and planning.
Exabeam
Find out how a behavioural analytics-driven approach can transform security operations with the new Exabeam commissioned Forrester study.
Link image
Story image
Sustainability
ABB unlocks sustainable energy upgrades for data centers
ABB says its new microgrid solutions will get data centers ready for the green revolution and use their own energy sources with a reduced carbon footprint.
Story image
Sustainability
NTT launches IoT Services for Sustainability offering
"We know what actions are needed to build a more sustainable future and have a robust suite of technologies available to help deliver this impact."
Story image
SD-WAN
Orange moves Siemens AG’s entire operations to a SD-WAN
Orange Business Services has migrated Siemens AG's entire global operations, 1168 sites across 94 countries, to a SD-WAN
Story image
Sustainability
Grasping the opportunity to rethink the metrics of a sustainable data centre
A data centre traditionally has two distinct operations teams: the Facility Operations team, and the IT Operations team. Collaboration between them is the key to defining, measuring, and delivering long-term efficiency and sustainability improvements.
Story image
Sisense
Data and analytics could be key to higher selling prices in APAC
Sisense's latest report has found that almost half of data professionals in APAC think customised data and analytics can create better selling prices for their products.