New performance evaluation method helps realize computing power for hyperscale cloud clusters
Today's cloud-based internet services are commonly hosted in hyperscale data centers with a large fleet of computers. A hardware upgrade, a software update, or even a system configuration change can be costly in these types of environments.
For this reason, the potential performance impact needs to be thoroughly evaluated so that IT staff can decide on accepting or rejecting the change. An understanding the evaluation results, including the root cause of performance change, is important for further system optimization and customization.
Evaluating the performance in the scale of a cluster has long been a difficult challenge. Traditional benchmark-based performance analysis conducts load testing in an isolated environment. However, this type of performance analysis cannot represent the behavior of a variety of co-located workloads with varying intensities in the field.
Alibaba, a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and government organizations in more than 200 countries, developed the System Performance Estimation, Evaluation and Decision (SPEED) platform to address this very challenge.
SPEED is the data center performance analysis platform that handles a mix of businesses, including eCommerce, Big Data, and colocation in production. SPEED has been used for years to ensure software and hardware upgrades at scale and cluster resource stability. It also provides the data-driven foundation for critical decision-making involving technology evaluation and capacity planning.
Intel Platform Resource Manager (Intel PRM) is a suite of software packages to help monitor and analyze performance in a large scale cluster. The tool supports the collection of system performance data and platform performance counters at the core level, the container level, and the virtual machine (VM) level.
The suite contains an Agent (eris agent) to monitor and control platform resources (CPU cycle, last level cache, memory bandwidth, etc.) on each node, and an Analysis tool (analyze tool) to build a model for platform resource contention detection. Some derived metrics depend on certain unique performance counters in Intel Xeon Scalable processors.
A key metric of SPEED, resource usage effectiveness (RUE), measures the resource consumed for each piece of work done, for example, a transaction completed in an eCommerce workload. Since the RUE metric is monitored for each workload instance, it is preferred that the resource consumed is measured using the system metrics in a low-overhead mechanism.
As Jianmei Guo, an Alibaba staff engineer and the leader of the SPEED platform, explained, "RUE requires reliable measurements for resource utilization, and as a server provider, Intel brings in tools and methodology to work with Alibaba on the hyperscale cluster performance evaluation. Intel and Alibaba worked together on strengthening the data collection and analysis in SPEED."
Solving the challenges of performance evaluation
We start from sampling RUE per instance in the cluster, then aggregate the samples to a workload-level metric after removing the bias from the varying workload intensities, and eventually derive the performance results.
For a better understanding of the performance speedup, we speculate the potential cause of the performance change and select auxiliary metrics derived from the platform performance counters or the system statistics. A further analysis on the auxiliary metrics is used to strengthen the empirical justification of performance change in addition to the primary analysis of RUE.
In traditional offline performance analysis, small-scale benchmark workload is often stressed to peak or a certain fixed intensity by load generator in an isolated environment. This type of benchmark is simple, reproducible, and useful for offline performance profiling.
However, in a cloud computing environment with a large number of workloads co-located together, the workload behaviors are quite diverse due to the varying intensities and possible micro-architectural interferences between workloads, making this load-testing approach less practical.
To solve these shortcomings, a new performance evaluation method involves the following step-by-step procedures:
1. Propose a system change for performance improvement and speculate why and how the system change impacts the performance from a certain perspective.
2. Monitor multiple instances of important workloads in the cluster for workload throughput and resource utilization before and after applying the system change, and calculating the baseline average RUE (primary metric) and the new average RUE for each job. During the calculation, the bias of RUE with respect to the varying workload intensity is identified and eliminated through a regression approach.
3. Calculate the job level performance speed-up using the baseline average RUE and the new average RUE of the job. In the case where each high importance job is assigned with an importance weight, a cluster level speed-up can be aggregated with a weighted average across the jobs.
4. Based on the speculation in Step 1, define the relevant auxiliary metrics that can be used as the evidence to strengthen the results of speed-up in Step 3. Monitor and calculate the auxiliary metrics for a sampled set of instances before and after applying the system change.
5. Calculate the job level metric change using the baseline auxiliary metrics and the new auxiliary metrics. If the metric change complies with the speculation, the performance speed-up is regarded to be strengthened from the specific perspective speculated in Step 1.
Just recently, Intel released a white paper featuring two case studies on this new performance evaluation method. The first case study evaluates the performance improvement from a change of core pinning setting. Before the change, two containers running instances of a job are not pinned to the processors. Instead, their CPU resource usages are simply limited by the CPU quota assignment.
During running, the Linux CPU scheduler moves the worker threads of a workload across the processor boundary. With current NUMA architecture and default first-touch NUMA policy, such a movement introduces a higher latency in memory access.
The change applied is to pin the two containers to the two processors, respectively. After applying the change, we speculate the workload performance will improve due to micro-architectural performance gain through reducing the amount of remote memory access.
The second case study evaluates the performance impact of replacing Pouch, an open source container engine created by Alibaba, with the Kata container in a workload colocation context. A container created from Pouch is a standard Linux container from the performance perspective.
Kata is a secure container based on the lightweight VM mechanism offering a better security context. After replacing Pouch with Kata, we speculated that the performance may be impacted due to the overheads in virtualization, such as the translation of system calls in virtualization, an additional layer of page table translation in the guest OS.
As Alibaba's Jianmei Guo said, "The reason we concern ourselves with performance in Alibaba is that we aim at cost reduction while continuously improving performance through constant technological innovation.
"We're happy to work with Intel to support the performance monitoring and analysis challenge in Alibaba SPEED with Intel PRM.