Hardware failures are all too common in large-scale data centers and cloud service infrastructure, and these failures can cause service level agreement (SLA) violations and severe loss of revenue.
Memory failures are among the most critical hardware failures that occur in data centers today, notorious for severely impacting system reliability, availability, and serviceability (RAS). These failures can be caused by a wide range of factors beyond normal use, including manufacturing defects, and extreme environmental or operating conditions.
While commonly accepted techniques such as error correcting code (ECC) and correctable errors threshold-based predictive failure analysis (PFA) help overcome some correctable errors with dual inline memory module (DIMM), they have cost, reliability, coverage, and performance implications.
A burst in the number of correctable errors could result in the performance degradation of a server and even denial-of-service. Furthermore, ECC and correctable errors threshold-based PFA cannot help to overcome uncorrectable errors — the catastrophic failures that typically result in crashes.
Intel Memory Failure Prediction (Intel MFP) is the ideal solution for organizations that rely heavily on server reliability, availability, and serviceability. Predicting future memory failures before they occur has become critical for today's data centers. By analysing historical data to predict potential catastrophic events, Intel MFP predicts memory failure events before they happen.
The solution features several innovative and original capabilities. It predicts micro-level failures in rows, columns and cells based on historical data, using a low-overhead online learning method to improve its prediction accuracy and avoid interfering with critical compute tasks.
This also enables Intel MFP to generate an estimated memory health score for proactive memory failure management, allowing users to take actions accordingly. Intel MFP is vendor-agnostic, and works in conjunction with other data center management solutions, including Intel Data center Manager (Intel DCM).
Reduce memory failure-related server crashes by 40%
In a case study with Tencent, initial collaborative testing of the Intel MFP algorithm showed quick results with a five-fold reduction in memory failure and system downtime. The same partner also extended this support by leveraging intelligent avoidance of failing memory at the operating system level until that memory module was replaced.
In a similar case study with Meituan, the company saw a 40% reduction in server crashes caused by memory errors. The company monitored the health of the memory modules of their servers by integrating Intel MFP into their existing data center management solution. By analysing data that was previously collected by their data center management software, they were able to generate prediction scores for each DRAM module, and then take appropriate action to maintain their SLAs and maximize service uptimes.
Armed with a new capability, Intel worked with AMI, a global leader in powering, managing and securing the world's connected digital infrastructure through its BIOS, BMC and security solutions, and determined to expand this support to the rest of the industry.
Because capturing and analysing memory errors requires a close relationship between both the UEFI and BMC firmware, AMI worked to make Intel MFP easy to adopt into existing and future server platforms.
As errors are captured, they are recorded by the BIOS and certain metadata information is then passed to the BMC firmware. The BMC firmware then takes this metadata and runs it through the Intel MFP engine to calculate a health score for the memory module. As new errors are detected, the AMI solution tracks the health score of each memory module and exposes the result for analysis by system administrators.
AMI's default implementation provides the current memory module health score information in the Web UI for the BMC and exposes the same memory health score information via RESTful APIs following DMTF Redfish standards.
The RESTful APIs allow for easy integration with existing data center management software. However, for those data centers less inclined to integrate with their own software, AMI offers a data management tool called AMI Composer, developed to be fully compliant with the Intel Rack Scale Design and DMTF Redfish standards, which will aggregate all information and provide it through a single web-based dashboard.
Immediate benefits for data centers and cloud service providers
Of course, when creating a machine learning algorithm, it is never actually complete. The current Intel MFP model supports DDR4 memory modules running on platforms with Intel Xeon Scalable processors, and Intel continues to collect more information regarding memory errors and failing memory modules to improve models.
Additionally, when new memory module types are introduced to the industry or improvements to existing technologies are rolled out, Intel MFP will support them.
Most importantly, all updates will be properly analyzed for inclusion in the MFP model so that as Intel updates the MFP model, AMI will provide easy-to-implement updates to the existing technologies provided to industry partners.
For data centers and cloud service providers, the benefits of adding Intel MFP support in Aptio V UEFI Firmware and MegaRAC BMC Firmware are clear and immediate. Data center SLAs are improved. DIMM failure rates are reduced through proactive memory health evaluation and enhanced memory page offlining policies.
And, most importantly, higher DIMM performance and reliability optimizes workload and virtual machine (VM) migration decision-making to boost efficiency and flexibility while reducing total cost of ownership.
For companies looking to take advantage of Intel MFP on systems with AMI Aptio V UEFI BIOS and MegaRAC BMC firmware, they are advised to ask their system manufacturer to include the AMI with Intel MFP option pack for MegaRAC BMC Firmware and AMI with Intel Memory Failure Prediction eModule for Aptio UEFI Firmware.