dcn-as logo
Story image

Maintaining uptime in the data center is no game of checkers

13 Feb 2020

Article by Intel Data Center Management Solutions general manager Jeff Klaus.

While popular notions of artificial intelligence (AI) may have once conjured up images of automation such as 2001’s HAL 9000, The Terminator’s Skynet, and Ava of Ex Machina, in reality, AI and its subset, machine learning, had more benign origins.

Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed on where to look for them.

Arthur Samuel, one of the pioneers of machine learning, taught a computer program to play checkers, an objective that is not something he could have programmed explicitly. In 1962, Samuel’s machine learning program not only bested him in the game of checkers, but ultimately was successful in overcoming the Connecticut state champion. 

AI enters the data center

Today, Gartner estimates that 37% of enterprise organizations are already implementing AI in some form, and associated technologies such as machine learning and deep learning promise to save organizations billions of dollars over the next few decades as financial services, healthcare, oil and gas, and retail companies build data science applications, recommendation engines, large-scale analytics, and other new applications driven by high-performance computing (HPC) environments.

AI and machine learning technologies can also be leveraged to improve the efficiency of IT operations in the data center. In the operation of enterprise and cloud service provider (CSP) data centers, IT equipment is commonly managed and operated in a passive manner.

That is, IT operators can do little to nothing before servers, network and storage equipment failures happen, after which they invariably ask their equipment vendors to repair devices or take reactive measures, thus commencing a standby environment or deploying a business load. 

This method can be adequate for managing small-scale server clusters. However, for current large enterprise or CSPs, which can have more than thousands of servers, IT teams would be under tremendous operational and maintenance pressures if they continued to manage their equipment this way.

For the enterprise and CSP, which require high reliability in the operation of the data center, and particularly given the high availability and stability demands placed on public cloud service providers, the inherent risks involved could prove damaging to both the balance sheet and organizations’ brand reputation.

A 2019 global survey of enterprise organizations by Statista found that for one out of four companies worldwide, the average cost of server downtime was between $301,000 and $400,000 per hour. With these stakes in play, maintaining uptime in the data center is no game of checkers.

Today, many types of IT equipment provide logs to help diagnose and analyze problems, collecting data through out-of-band or operating system agents, learning error patterns in the logs through machine learning algorithms, and establishing corresponding models for abnormal judgment and identification.

Furthermore, analyzing the equipment operating status helps to make fine-grained predictions of the equipment health status. The operator or the software system can take the next action by analyzing the results or trends before failure happens. For example, adjusting the load on the server and migrating the load.

Machine learning in service of uptime

Memory failures are one of the top three hardware failures that occur in data centers today. Using machine learning to analyze real-time memory health data would make it possible to predict such failures ahead of time, and this ultimately translates to a better experience for end users of the application.

The Intel Memory Failure Prediction (MFP) is an AI-based technology for improving memory reliability due to predictions based on the analysis of the micro-level memory failure logs. It’s an ideal solution for enterprise businesses and CSPs that rely heavily on server hardware reliability, availability and serviceability. Intel MFP helps to significantly reduce memory failure events by analyzing data and then predicting catastrophic events before they happen.

Intel MFP uses machine learning to analyze server memory errors down to the Dual Inline Memory Module (DIMM), bank, column, row, and cell levels to generate a memory health score, which can be used to predict potential failures. By analyzing memory errors and predicting potential memory failures before they happen, Intel MFP can help improve DIMM toss and purchase decisions.

Additionally, Intel MFP allows data center staff to migrate workloads before catastrophic memory failures could happen, use page offlining policies to isolate unreliable memory cells or pages, or replace failing DIMMs before they reach a terminal stage, thus reducing downtime by responding appropriately before server failure occurs.

Recently, a Beijing-based company whose online platform and applications connect consumers with local businesses for everything from food delivery and hotel bookings to health and fitness products and services monitored the health of the memory modules of its servers by integrating Intel MFP into their existing data center management solution.

The initial test deployment of Intel MFP indicated that if the company deployed the solution across its full server network, server crashes caused by hardware failures could be reduced by up to 40 percent, delivering a better experience for hundreds of millions of its customers and local vendors.

Read the whitepaper here.

Story image
HPE awarded $160 million contract to build supercomputer in Finland
The supercomputer, which is referred to as ‘LUMI’ by EuroHPC JU, will help European researchers and private and public organisations significantly advance R&D and drive innovation in areas such as healthcare, weather forecasting, and AI-enabled products.More
Story image
Alibaba Cloud launches new cloud database solutions following market growth
"We want our customers to ride on the future trend, and we will continue to innovate and provide our customers with the best database technology so that together, we can build a solid foundation in their digital transformation process."More
Story image
Gartner reveals the top strategic tech trends for 2021
“CIOs are striving to adapt to changing conditions to compose the future business - this requires the organisational plasticity to form and reform dynamically. Gartner’s top strategic technology trends for 2021 enable that plasticity.”More
Link image
The 5G network emulation solution that accelerates device workflow
Here's how to streamline your workflow across test domains including protocol, radio frequency (RF)/radio resource management, and functional and performance testing.More
Story image
Nokia to migrate all on-prem IT infrastructure to Google Cloud
Nokia’s infrastructure and applications will operate in the public cloud or in a software-as-a-service model from now, and the company expects the extensive migration to take between 18 and 24 months to complete.More
Story image
Global Switch extends leadership team with two new appointments
Global Switch has added more expertise and experience to the company’s senior management team with two new appointments, and states that this will support the company’s growth strategy and global expansion.More