4 Principles to Enabling a Pristine Data Lake

PinIt
data lake

What are the challenges you need to tackle to keep your data lake from becoming an unwieldy “data swamp?”

When data lakes were first introduced in the 2000s, it was with the promise of providing enterprises with a simple way to store their growing data volumes and various data formats for analytical purposes. They were designed to be an easier method for storing structured and unstructured data at scale – an alternative to traditional enterprise data warehouses that were becoming too expensive and failing to meet the rapidly changing needs of analytics consumers.

As data warehouse reports were typically time-consuming to create and distribute, they caused lag in the useful half-life of the data. For that same reason, data lakes were used as a flexible, quick and cost-effective alternative. Not only did they allow all types of data to be in one place whether it was relational, semi-structured, unstructured or binary data, but data lakes could also store raw copies of source system information for analytics, reporting and visualization. Enterprises that wanted to leverage the power of cloud, mobile and Internet of Things (IoT) applications could power these applications with real-time data delivery from the data lake.

See also: 6 guiding principles to an effective data lake

You may recall the cries of “The data warehouse is dead!” as enterprises started their shift to data lakes. But while the technology offered great flexibility, IT teams faced new challenges with data lakes, including a lack of data management, governance, and skills, causing many to fail. These data lakes – often referred to as v1.0 – are giving way to updated versions that leverage the cloud and eliminate many of the challenges through automation and built-in services.

Challenges to Building the Right Data Lake

As enterprises rolled out these first-generation data lakes, three main challenges began to emerge which soured a few technologists on the concept as these data lakes became unwieldy “data swamps.”

#1: Complex technology and lack of best practices

As with the launch of all new technology concepts and platforms, it is new ground with no playbook, standards or successful examples to imitate. For data lakes, there seemed to be an infinite array of technology components and “expert” advice to choose from when building it. Each component that was needed to implement the Hadoop-based data lake – whether it was data ingestion, process, governance or storage – required specialized expertise and ran on competing architectures. IT administrators were scratching their heads trying to determine if they should deploy Apache HDFS, Apache HBase, streaming applications or otherwise.

And even when those initial decisions were made, managers had to rethink everything all over again as their data lakes grew. They had to decide which of all the new technologies coming out was the right one for their organization and future goals. IT admins had – and still do – the consistent pressure of making sure that all the data in the lake is fresh and timely as executives depend on it for critical business analytics.

#2: Low data trust

With enterprises storing every bit of business data, data lakes were becoming unwieldy and lack of metadata often failed to answer the important questions: “Where did the data come from? When was the data moved? How was it transformed or enriched or modified?” Without that detail, confidence in the data floundered.

These first-generation data lakes and ingestion processes failed to take data lineage or metadata into account which is needed to guide governance, inform data provenance and ensure compliance. So, while the data delivery times increased, the trust in said data fell drastically.

#3: Market changes

These first generation data lakes were initially deployed on-premise as part of a single vendor stack. Over time, public cloud providers began to offer managed versions of Hadoop and other data services that were often more cost efficient due to pay-as-you-go models. Amazon EMR, Microsoft Azure HD Insight and Google DataProc were among these new pre-packaged offerings. Enterprises saw this as a way to utilize the cloud to take better advantage of data lake platforms as the cloud provided greater security, availability, and elasticity, as well as faster deployment times and lower costs linked to actual use.

Yet, with the move to the cloud, IT was left struggling how to meld together their on-premises data lakes with those in the cloud while still providing the freshest, most accurate data to analysts, executives and scientists.

Building the Next Generation Data Lake – Pristinely

It is these challenges which caused technologists to re-think the way data lakes should be implemented. The unexpected complexity, increasing costs and confusion became counterproductive.

But not everything is lost! The underlying promise of data lakes still holds true – they are a great platform for handling massive data stores and enabling real-time analytics. However, to be as great as enterprises expect them to be, IT needs to rethink the approach and apply the painful lessons learned by the early practitioners. Instead of manual coding and time-intensive data re-work, they need to leverage the latest technology to employ automation and real-time data streaming.

Following these four key principles, IT leaders can make the right choices and take advantage of the right data at the right time.

#1: Align initial deployment with key business priorities

Successful data lake implementations typically start with an identified set of business priorities that are not being met with traditional solutions. IT must understand what these teams want to get out of the data – what is essential for managing business operations and provisioning the right set of data to the specific needs.

Outlining specific use cases, holding use case discovery workshops, or creating value-definition frameworks can help to bridge this knowledge gap. When these two groups become aligned, the biggest barrier to a productive data lake is removed.

#2: Address the skills gap

From Pig to Scala to Spark, there is an ever growing set of technologies and languages that surround a data lake landscape. Data pipeline automation is essential to overcome the skills required in data ingestion, integration, transformation and refinement for analytics. This technology assists in providing data directly into the data lake or analytical platforms and, when used, can increase IT productivity. The technology also eliminates the need to do repetitive and manual data lake ingestion or data-set provisioning tasks. As a result, IT has more time to work directly with the analytics team to remain aligned with their business needs.

#3: Operationalize processes

To avoid the extended “data swamp,” where data is just sent to languish, IT can put some structure around how data flows into the data lake.

Creating analytics-ready data in the lake can be achieved by establishing a multi-zone data pipeline methodology which accelerates time to insight, provides greater security and enables data compliance and governance. Forward-thinking enterprises opt to use three main zones in an automated pipeline:

  • Landing zone – where raw data and continuously changing is ingested to the data lake from different sources using change data capture streams
  • Assembling zone – where data is standardized, merged and repartitioned into a consistent form and ready for further refinement
  • Provisioning zone – where data subsets are made available in a variety of formats from data marts to operational and historic data stores – giving data analysts and scientists access to curated data sets for better analytical insights

#4: Trust through metadata

Data trust was a challenge with first-generation data lakes. In version 2.0, metadata is a top priority. From data ingestion to transformation and provisioning, metadata is captured and stored at every step. Data sets are cataloged along with a complete understanding of lineage so that every user can find the analytic ready data and have confidence as they make business-critical decisions.

Automating the metadata management process and complete audit logs also helps in terms of compliance as it allows IT to better understand how the data is used, by whom and for what purpose(s).

Fulfilling the Data Lakes Promise

IT professionals always have to assess and understand how the data landscape is changing. When they adhere to the four aforementioned principles to improve their next-generation data lakes, it is far easier to build an architecture – whether on-premises, in the cloud or via a hybrid model – that will meet the historical and real-time analytics requirements of the business.

These four steps help organizations avoid the challenges previously associated with data lakes, making them an attractive complement or modern substitute for traditional data warehouses.

Dan Potter

About Dan Potter

A 20-year marketing veteran, Dan Potter is VP Product Management and Marketing at Attunity. In this role, Dan is responsible for product roadmap management, marketing and go-to-market strategies. He has also held prior roles at Datawatch, where he was CMO, and IBM where he led the go-to-market strategy for IBM’s personal and workgroup analytics products. Dan has also held senior roles at Oracle and Progress Software where he was responsible for identifying and launching solutions across a variety of emerging markets including cloud computing, real-time data streaming, federated data, and e-commerce.

Leave a Reply

Your email address will not be published. Required fields are marked *