Equinix: Creating a distributed data repository at the edge

Thu, 16th Nov 2017

FYI, this story is more than a year old

Historically, increasing volumes of large scale datasets - petabytes and above - have been concentrated in centralized corporate data centers which can create data concentration risk.

However, the complexities of these environments and the interactions between data, people, systems, applications, analytics and clouds that lie outside of these data centers can have a detrimental impact on performance and quality.

What's going on with all this data?

First, data is growing exponentially - from gigabytes to terabytes, to petabytes and now approaching zettabytes. Proportionally more data is being generated each year than what already existed in storage.

Second, data is no longer centralized.

Its creation and processing have shifted to the edge, where distributed data growth and consumption occurs.

This increasing data gravity at the edge is attracting associated applications and services, such as analytics tools drawing real-time insights from multiple, distributed data sources.

The physics of latency, along with ever-increasing bandwidth costs, make backhauling all this data to a centralized data center, far from these applications and services, unsustainable.

Finally, businesses that were once product-driven are becoming increasingly data-driven, as data becomes more valuable to businesses, customers and competitors, and a more desirable target for bad actors.

As more and more data is created, accumulated, updated and accessed at the edge, proper tools must be available that can manage data segmentation logically and geographically to prevent unmanageable data sprawl at the edge.

What's impeding optimal data access and management at the edge?

A number of obstacles prevent digital businesses from fully leveraging their data capabilities, including:

Centralized data architectures address localized data requirements at a medium scale (by today's standards). They were not designed for large-scale, geographically distributed data access. This demand is well beyond current IT infrastructure and network design limitations.
Data sovereignty and privacy regulations force the geographical localization of some data types. These same concerns also make storing data in the cloud as an alternative, a controversial, if not impossible, option.
Corporate policy and industry compliance requirements can also limit what can be done with data.
As businesses and applications struggle to shift to the edge, the data management and security capabilities needed to control access are fragmented or worse, absent.
New data is continuously being generated at the edge, either creating more sprawl, or experiencing a bandwidth/latency backhaul problem with ongoing capacity and QoE issues and inherent risks.

There has to be a better solution

In our last article in this series, Data is Pouring Over the Edge, we described using a digital edge node, as part of an Interconnection Oriented Architecture (IOA) strategy, to act as an interconnection hub, tailored for local or shared data services at specific geographic locations.

An IOA framework leverages distributed digital edge nodes to create distributed data repositories that can be controlled, improving performance and security in multiple edge locations, while optimizing wide area networking.

Using distributed storage platforms that apply erasure coding to deploy a single namespace data service (optimized for high availability and data protection) in all edge node locations, you can make your data immutable, essentially shielding it against human error and data corruption, and dramatically increase availability.

The data service can use policy-based controls to address logical and geographical data segmentation, and can support cloud provider's large-scale, multi-tenant data services (i.e., private cloud storage, object storage, etc.).

You can geographically place data nodes (Data Hubs) in each edge node, as well as in cloud environments.

Built-in algorithms interpret policies and store the actual data in a way that protects it from device, location or even regional failures and breaches, without losing data or access capabilities.

This strategy offers far more protection than a data “copy” approach and uses much less storage. Data services can also be optimized for integration, supporting multiple interfaces (e.g., web, APIs, file system) as part of the data abstraction layer across multiple underlying technologies realized with the IOA framework (see diagram below).

A Distributed Data Repository Infrastructure with Global Namespace

To create a distributed data repository at the edge, follow these steps:

Design the geographic placement of data nodes. Even though implementations can vary, the best practice is to have a minimum of four locations (e.g., cloud environment, number of virtual machines). For optimal protection, deploy 16 nodes (four locations each with four nodes).
Place the first data node at an edge location and start the namespace data service. As more data nodes are added, your private data cloud/namespace will expand in size and the service will update itself.
Integrate the service with boundary control, inspection zone(s), policy enforcement and application API management.
Apply event processing and monitoring.
Establish logical capacity buckets and access groups with protection and placement policies.
Continue to add more data nodes to scale namespace capacity. Data nodes deployed in the cloud can be backed by cloud-provided storage (e.g., AWS S3), and control policies will determine cloud use.

The benefits of a distributed data repository at the edge

A distributed data service with centralized management solves the following constraints:

It's designed to handle zettabyte-scale data sizes.
It's data accessibility significantly reduces the volume of data moved and copied over long distances and centrally stored, which reduces WAN bandwidth and storage costs.

It also enables the following benefits:

All data can be automatically encrypted at the data layer and identity keys can also be dispersed (not stored in any single place).
Reduction of WAN bandwidth and storage costs
Greater security and policy management integration ensures compliance consistency in all regions.
Any non-latency-sensitive data requirements can be addressed using the data repository as primary storage and/or an archive. An eventual consistency model can also be applied to achieve high availability that informally guarantees that, if no new updates are made to a given data item, all accesses to that item will eventually return the last updated value.

Examples of a distributed data repository's use include: shared drive, package distribution for applications/containers, logging repository, staging area for analytics, etc.

Article by Olu Rowaiye, Equinix blog network