The limitations of an outdated remote data center monitoring system
Data center monitoring services have been around for over 10 years. Over that period of time, many of these systems have not been updated to reflect changing data center technologies.
As a result, the lives of systems administrators have become more complex and maintaining data center uptime has become more of a challenge.
When compared to the systems of 10 years ago, modern data center power and cooling infrastructure has become more intelligent.
With more built-in data points, these systems produce, on average, 300% more alarm notifications than they did in the past. Therefore, data center staffs have to deal with much more alarm support "busy work".
The whole point of monitoring data centers is to reduce the risk of downtime by identifying and addressing a state change before an uptime-threatening incident occurs.
This becomes a challenge when alarm fatigue overwhelms the staff, when no unified monitoring platform exists (i.e., individual power and cooling devices have their own native management solution), and when administrators find themselves having to contact various vendor customer support lines for help.
Traditional remote monitoring is not an online service and therefore it cannot provide real-time monitoring. Instead these older systems produce intermittent status updates, oftentimes via email. New digital remote monitoring systems are connected to a data center, usually through a gateway.
Therefore, these new systems can employ IT services such as cloud storage and data analytics to help system administrators cope with the vast increase of equipment performance data.
Simplicity = efficiency
New on-line monitoring systems simplify system administrator work because they employ big data analytics and machine learning techniques.
Big data analytics are supported by software tools that process the monitoring system data so that decisions can be made on which actions to take. Big data analytics are required when data volumes increase, when data becomes unstructured (i.e. data variety like emails, free-form text fields, or trouble tickets) and when data is processed in real-time.
Machine learning is related to data analytics in that it uses data to make predictions. However, it also improves the overall support model by factoring in results from previous learning. That means the monitoring system gets smarter over time.
These tools also streamline how data center operators manage systems uptime. In the case of a data center remote monitoring service, event processing and prioritization of alarms can be much more efficiently managed. Network Operation Center (NOC) experts can notify and guide systems operators during an event that triggers multiple alarms. Alarm consolidation can convert multiple alarms from the same device into a single incident.
Since so many data center operators now use mobile devices as a common interface into systems, automatic trouble ticket generation can be provided through a mobile app which can track incidents via live chats and instant messages. Contextual alarms can provide administrators with useful information like the origin of the problem (e.g. data center X, data hall Y, rack 15C), who's involved, the number of alarms generated, and what to check first.
Event correlation and root cause analysis can be performed which evaluates multiple alarms, deduces possible causes, and proposes possible solutions. This correlation process, performed by domain experts in a NOC, can be combined with machine learning so that future downtime incidences can be avoided.
Data centers are on a path to become more reliable and efficient through the use digital remote monitoring. However, this can only happen with platforms that interpret and leverage the data generated by the physical infrastructure in a data center.
Article by Victor Avelar, Schneider Electric Data Center Blog