SHARE
Facebook X Pinterest WhatsApp

Building a Smart Data Lake While Avoiding the ‘Dump’

thumbnail
Building a Smart Data Lake While Avoiding the ‘Dump’

A data lake needs to be fed and governed properly before analytics can discover kernels of insight.

Written By
thumbnail
Joel Hans
Joel Hans
Jan 2, 2017

Data lakes are all the rage right now, and will continue to grow in 2017, but they’re much more than a dumping ground for unmodeled and unverified data of all types. Companies need to approach them strategically, and with some solid understanding of current best practices, in order to keep management at a minimum and give various analytics tools the best shot at extracting meaningful data.

In a recent webinar from TDWI and Pentaho, Philip Russom, the senior research director of data management at TDWI, said, “You can’t just plan your lake as a data repository. You also need to plan the toolage around it.”

The data lake, in brief

Data lakes are the function of companies collecting more data than ever before, and then demanding that technical teams make new insights from that data. Data is persisted in its raw state so that it can handle large volumes of diverse data, quick ingestion, and leave many opportunities for analysts to attack it with new technology.

Most data lakes are built using Hadoop, an open-source framework. Hadoop isn’t necessarily required, but it is where most companies are headed. Russom praises Hadoop’s benefits, such as the ability to manage multi-structured and unstructured data, and a relatively small cost compared to relational databases like MySQL. Russom says, “Hadoop is not just storage. Equally important is that it’s a powerful processing platform for a wide range of analytics, both set-based and algorithmic.”

Companies are using data lakes for analytics, reporting, marketing, sales, and more. Best of all, a data lake helps companies get business value from both old and new data.

Advertisement

The lake is hungry—how to feed it right

Without some smart management for the data going into the lake—if you simply launch a Hadoop-powered data lake and throws everything into it—you’re going to end up with a “toxic dump,” according to Chuck Yarbrough, the senior director of solutions marketing and management at Pentaho, who also presented during the webinar.

The challenge is that incoming data varies in volume, diversity, type, whether there’s metadata or not—it’s a lot to think about, but the ability to ingest data is essential if you want a variety of users to actually take advantage of it.

“You need a mechanism that can be controlled, goverened as a process, to deliver exactly what you need into the data lake. And not just dump information in there,” Yarborough says.

Russom says companies should be relying on data integration tools and infrastructure to make that controlled, governed process possible. That includes metadata management, and strong integration with other data warehouses that might exist within an enterprise. He also suggests developing metadata as it’s ingested and doing more on-the-fly data modeling. Essentially, it’s less about abandoning older data management techniques, but rather refining them for the data lake’s particular nuances.

Advertisement

Data can’t just be left to fester

Once data is ingested into the data lake, it can’t just be left there, even though data lakes most often preserve data in its original, detailed state. Russom says that unlike data warehousing, data lakes turn the data modeling practice “on its head.” He says, “Instead of preparing the heck out of data before it goes into the database, we’re looking at getting the data in there in its raw state, and then that’s when we make another pass for modeling, data quality, and a number of other practices.”

Yarbrough agrees, and insists that companies need to ensure governance, be that for dealing with sensitive customer information, or for having technical standards that make the entire data lake more productive. “If you have data integration technology today, it probably has some ability to do big data. What you need to be sure is that it’s going to be able to support the use cases you have.”

Advertisement

Getting ready to extract insights

Many companies are already taking advantage of data lakes, but Yarbrough wanted to point out IMS, an insurance company collecting telematics/IoT data from their customers to offer better premiums. They’re already handling 1.6 billion data points per day, and can create detailed reports on the fly, from a number of different data sources. They’re not using Hadoop right now, but they plan to migrate in the near future.

Yarbrough says that perhaps the most useful element about their data lake is that they’re able to use different technologies to analyze it without having to re-architect their solution. The well-managed data lake is, in many ways, a future-proofed strategy.

For companies that want a new foundation for making new, exciting analytics in the coming year, there’s probably no better place to look than the well-managed data lake. They’re shaping up to be the hot topic in big data in the coming year.

Advertisement

More on this topic:

Hadoop and Spark

Usage-based insurance

thumbnail
Joel Hans

Joel Hans is a copywriter and technical content creator for open source, B2B, and SaaS companies at Commit Copy, bringing experience in infrastructure monitoring, time-series databases, blockchain, streaming analytics, and more. Find him on Twitter @joelhans.

Recommended for you...

Top 10 Hadoop Data Migration Traps
Big Data Battle Shifts Fronts
Michael Vizard
Jun 22, 2019
Arcadia Data Bolsters BI Capabilities for Cloud-Based Data Warehouses
Sue Walsh
Apr 2, 2019
Case Study: Multinational Retail Chain Migrates To Hadoop
David Curry
Mar 7, 2019

Featured Resources from Cloud Data Insights

Why Network Services Need Automation
The Shared Responsibility Model and Its Impact on Your Security Posture
The Role of Data Governance in ERP Systems
Sandip Roy
Nov 28, 2025
What Is Sovereign AI? Why Nations Are Racing to Build Domestic AI Capabilities
RT Insights Logo

Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.

Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.