Top 6 data center infrastructure management best practices
“Data center infrastructure management (DCIM) tools monitor, measure, manage and/or control data center utilization and energy consumption of all IT-related equipment (such as servers, storage and network switches) and facility infrastructure components (such as power distribution units [PDUs] and computer room air conditioners [CRACs]).”
—Gartner IT Glossary
The meaning of DCIM remains consistent, but the definition of infrastructure is evolving.
Historically, it used to refer to on-premises hardware. With an increasing reliance on cloud, the bounds of the traditional infrastructure are expanding. But no matter how you scope it, infrastructure management represents the full array of management practices, including:
- Knowing what assets you have
- Determining the values (What is the baseline? What is good? What is anomalous? What is bad?)
- Ensuring uptime
… also reflected in a Discover, Monitor, Support, Optimize (DMSO) model, which essentially boils down to monitoring, triage, remediation, and optimization – once the contents of the technology infrastructure have been determined.
Connecting DCIM with organizational business goals
Here's the challenge with DCIM: Most organizations don't think of it when they're thinking of business goals. Instead, it's more likely organizations consider DCIM to be a necessary evil that costs money and is possibly not aligned with business goals at all.
More specifically, many organizations choose to use a third-party provider for DCIM because it's not a business differentiator. Organizations will not perform better because they have a robust infrastructure management practice — but if they have a weak one, they will absolutely perform worse.
Furthermore, data centers are costly to operate, from HVAC and power to space (even with consolidation in power). Their inventory is complex to maintain, often resulting in accumulations of haphazard documentation rather than accurate representations of reality. Plus, the best and brightest in corporate IT usually aren't flocking to be on the patching team, making proper maintenance challenging.
Maintaining data center uptime and health
DCIM isn't just about keeping documentation up to date all the time; it's also about monitoring what the documentation represents. Anyone who operates a data center can confirm: the root cause of most catastrophic events is when things turn out to be different than expected. It's difficult to remedy a problem when you're uncertain how things are configured.
The foundational elements to this paradigm are downtime and loss of uptime.
- Downtime is caused by failing to maintain data center infrastructure management properly.
- Loss of uptime is either planned or unplanned. When planned, the loss of uptime can be controlled by scheduling it when it's most acceptable instead of experiencing an outage during a critical period of operations.
Top 6 DCIM best practices that are essential to a successful operation
DCIM is a complex, multifaceted process. While it's not usually considered exciting, it is necessary, and there is a right way to do it. To prevent damaging outages, consider the following five best practices for keeping up with a DCIM routine.
1. Know what assets you have.
It is imperative to have complete documentation of all data center assets. If you don't know what you have, it cannot be managed. Don't try to oversimplify with a spreadsheet; doing so will actually make your practice more difficult because it will demand a near-religious level of upkeep and judiciousness.
2. Determine whether you have the resources to keep eyes on your data center infrastructure
Start by asking: can you instrument all of that? Do you have the internal resources to watch over your assets? How do you measure whether it's available and performant?
Most organizations, big and small, need a partner to execute on DCIM because even well-equipped companies simply don't have the time to dedicate skilled resources to this activity.
It's best advised to make the investment and bring in an expert to execute a physical and logical audit. Whatever you do, your unmanaged inventory will not go away, and if left unaddressed, it will grow more complex and less manageable.
3. Always work off the same system of record
This applies to everyone involved. Different political entities within the company cannot use different systems. Everyone needs a single source of truth because data center assets are all interrelated; thus, the infrastructure management record must reflect the contents of the ITSM environment precisely. Also, make sure that change control processors also use that system of record.
Pro tip: Records include physical records. It might sound extreme, but take a picture of every rack and ensure it stays the way it was intended.
4. Make sure that you have ongoing monitoring in place
If there is an unplanned change, find out why. It will happen, but it's vital to detect and adapt processes to correct changes before they result in outages.
5. Understand what you're working for
It's not uncommon for a systems engineer to craft a beautiful plan with triple redundancies and all sorts of great features. The organization does not require that level of redundancy. The point is, it's important to understand what you're actually aiming for in DCIM.
6. Patch your servers.
That's it. That's the best practice.