Story image

How to stop data lakes from getting swamped

30 Apr 2018

A “data lake” sure sounds inviting.

Cool flows of structured and unstructured data, all streaming into a vast repository, where companies are free to fish out awesome new insights all day long.

But without the right approach, that data lake isn’t as welcoming as it looks on the surface.

The sheer volume of data, for instance, can easily overwhelm companies who aren’t discerning about what is filling the lake, and why.

The weight of all this data can also clog things up unless companies are committed to using the latest technology to integrate it and process it for maximum insight.

The data also needs to be fast and easy to access and secure, so companies can get value from it while ensuring the data isn’t misused or compromised.

In short, it doesn’t take much for a data lake to start looking like a data swamp: a stagnant, murky place, where when you stick in a net, you can’t be sure what will come up.

Avoiding data swamps is a must to truly capitalize on increasing volumes of data and generate new business intelligence that propels growth.

Fortunately, there are ways to keep data lakes dynamic, pristine and viable business assets.

Save the lakes 

The rise of data lakes is the result of the sheer amount of information available today.

Technologies like the Internet of Things (IoT) and its billions of global sensors stream out data that’s never been collected before, promising the discovery of insights that just a few years ago weren’t knowable and the monetization of data flows that we didn’t imagine existed.

Today, for instance, agriculture companies can crunch centuries of crop data to better predict weather patterns and yields.

Transportation firms can turn to big data to optimize traffic routes by combining past and current records about vehicle speeds, weather, road conditions and fuel consumption. It’s exciting, but this kind of information must live somewhere where it’s useful, accessible and safe.

Data lakes that can’t offer those things are a waste of money and a lost opportunity to capitalize on today’s unbelievably rich data resources. Here are a few quick tips for companies looking to avoid data swamps:

  • Be selective

Information overload isn’t a new problem, but it takes on new dimensions for data lakes in an age when Cisco says global big data volumes are soaring toward 402 Exabytes (1 exabyte = 1 billion gigabytes) by 2021, an eight-fold increase from 2016.

In the face of all that information, companies need to resist the temptation to over-collect data just because it’s available.

Companies need to know exactly what business problem they are trying to address and precisely what they hope to achieve with the data they’re gathering.

This can help them avoid filling data lakes with volumes of information that do nothing but bury them in the muck and prevent them from taking advantage of what their data offers.

  • Automate

To truly make sense of the data filling their data lakes, companies need to take advantage of emerging technologies like artificial intelligence (AI) and machine learning that can help them sort, analyze and learn from the data with superhuman efficiency.

These capabilities help companies spot patterns, create hypothesis and find value in their data lakes that might otherwise go unnoticed.

Companies are increasingly learning this. In NewVantage Partners’ annual executive survey, 76.5% of executives indicate that the proliferation and greater availability of data is empowering AI and cognitive initiatives in their organizations.

“The survey results make clear that executives now see a direct correlation between big data capabilities and AI initiatives,” according to the MIT Sloan Management Review.

In short, more automation means fewer data swamps.

  • Keep it close

Distance matters because it delays many of the functions that prevent data lakes from devolving into data swamps.

The further away data lakes are from where data is created or needs to be accessed and analyzed, the greater the chance that latency will slow analytics engines or the various processes that drive AI, such as interconnection between cloud apps, data sources, users, etc.

Creating data lakes in proximity to where data is stored, produced or needed by users and applications maximizes security and optimizes the functions powered by the data the lakes contain, which keeps the lakes fresh and productive.

Data lakes thrive here

A global interconnection platform is a place where a data lake can thrive.

It provides the proximity to various sources, data stores, analytics, and cloud and network partners that’s so crucial to keeping data lakes healthy.

Platform Equinix spans 48 markets on five continents, so companies can create data lakes close to almost anywhere.

The network- and cloud-density on Platform Equinix (1,700+ networks, 2,900+ cloud and IT service providers) is also a huge benefit because it enables interconnection to the cloud and network services needed to fully exploit a company’s data assets.

In addition, Equinix Data Hub is a solution deployed on Platform Equinix that’s designed to enable companies worldwide to store vast amounts of data at a local level, for quick access by the people and applications that need it.

That’s a data swamp preventative if there ever was one.

Article by Jim Poole, Equinix Blog Network

Veeam joins the ranks of $1bil-revenue software companies
It’s also marked a milestone of 350,000 customers and outlined how it will begin the next stage of its growth.
Veeam enables secondary storage solutions with technology partner program
Veeam has worked with its strategic technology alliance partners to provide flexible deployment options for customers that have continually led to tighter levels of integration.
Veeam Availability Orchestrator update aims to democratise DR
The ability to automatically test, document and reliably recover entire sites, as well as individual workloads from backups in a completely orchestrated way lowers the total cost of ownership (TCO) of DR.
Why flash should be considered the storage king
Not only is flash storage being used for recovery, it has found a role in R&D environments and in the cloud with big players including AWS, Azure and Google opting for block flash storage options.
NVIDIA's data center business slumps 10% in one year
The company recently released its Q1 financial results for fiscal 2020, which puts the company’s revenue at US$2.22 billion – a slight raise from $2.21 billion in the previous quarter.
Limelight Networks celebrates 100th point-of-presence launch
The company has increased its global network capacity by 40% in just five months, bringing its total egress capacity to 42Tbps.
Dell EMC launches interactive AI Experience Zones
The AI Experience Zones are designed to educate visitors about how to start, identify, and implement an AI project.
Salesforce continues to stumble after critical outage
“To all of our Salesforce customers, please be aware that we are experiencing a major issue with our service and apologise for the impact it is having on you."