While data lakehouses solve some issues, they are not a universal remedy. They really are the next generation of data lakes, incorporating some features and functionality found in data warehouses but with an eye toward data science.
The technology world is full of innovations that take useful aspects of two separate technologies and create a whole new category of products. Clock radios, fax machines, and smartphones stand as popular combinations that changed the lives of many.
A “data lakehouse” has been pitched as one of the newest examples of this type of innovation.
Backers describe it as a cross between a big, hard-to-access data lake and a costly, limited-functionality data warehouse. They say that data lakehouses combine the best features of data lakes and data warehouses: the flexibility and relatively low cost of a data lake, coupled with the ease of access and support for enterprise analytics capabilities found in data warehouses.
It’s a reasonable argument based on the needs in the marketplace and the shortcomings displayed in the age of unstructured (or semi-structured) data. But are data lakehouses really poised to become the market drivers proponents say they will? Or are they just another passing fad that’s making noise today but will be replaced by a new, more targeted innovation tomorrow?
The answer will impact the strategies of large numbers of enterprises looking for solutions to manage data in a variety of formats, including those that could potentially be analyzed by artificial intelligence (AI) and machine learning (ML) tools, such as text, images, video, and audio.
See also: What is a Data Lakehouse?
It’s a bird! It’s a plane! It’s …
Today’s rapidly expanding data landscape is being served not only by data lakes and data warehouses but also by data hubs and analytics hubs (with the functionality of these two platforms as generally nonexistent in data warehouses or lakes). What are all of these mechanisms? And how do they relate to each other?
Let’s start with a data lake. A data lake is the upstream location where all of the organization’s data flows. Data lives there in its raw state – either unstructured or structured, in image files, PDFs, databases, and other formats. Data lakes can typically ingest and manage almost any type of data, and as exemplified by Hadoop (historically the most popular type of data lake) and, more recently, object stores like S3, ADLS, and Google Cloud Store, they provide tools for enriching, querying, and analyzing the data they hold.
Data lakes have historically been used to explore new ways of mining, combining, and analyzing data that was thrown out or not used as part of day-to-day business processes. In other words, it was applied either to operational data that is no longer in service or to data that may be considered in the future for operational use but is nonetheless currently in exploratory mode.
A data warehouse tends to support long-standing datasets that represent fundamental, core data that runs the business: customer records, supply chain bills of materials, and so forth. Most of this data is highly structured but increasingly has semi-structured elements, incrementally built over time from multiple downstream data source silos. Changes to how the data is used can be time-consuming – not because of the data itself but because of the intricacies of how, where, and by whom it’s being used. New datasets – possibly after exploratory phases of work in the data lake – are made available for more regular, and routine analytics in the data warehouse, provided it can accommodate the size and structure of that data.
Data warehouses are increasingly incorporating data streams and advanced analytics on both historical batch and real-time data streams. In general, data warehouses also differ from data lakes in that they require some sort of data hub technology to prepare the data for ingestion.
But how do hubs come into play? A data hub is a gateway through which virtual or physical data can be merged, transformed, and enriched for passage to another destination. That destination might be an application or a database or some other kind of repository (such as a data lake or data warehouse) either for use by applications as a part of their ongoing business/operational process or by an analytics platform as a feedback loop on the process – automated or human decision support, exception handling, etc.
Historically, data hubs have been used by IT data integration specialists that work with separate siloed groups from across the enterprise to create integrations where none naturally existed. These integration specialists support development and ongoing management of a data model and the data pipeline that moves that data through the data hub. A data hub is complementary to both data warehouses and data lakes, as it essentially supports the stateful use of data against the data model at each stage of the data pipeline.
An analytics hub, on the other hand, serves a different need in the market.Cloud vendors are starting to realize the problem of getting data to analysts and data scientists as well as to citizen or power user analogs of these two roles – namely, getting fresh and complete data sets to the key stakeholders in an organization responsible for generating answers to the business questions.
Analytics hubs, like data hubs, operate more like point-to-point switches instead of actual hubs, and, as a result, there’s not a lot written about them. There is no persistence of data at the point of unification, and analytics hubs depend on an external cloud data warehouse or data lake – or any source repository with standard interfacing – to store input data. There is no effort to curate data in stored views from multiple projects, users, and long-term use; instead, the focus is on the execution of a series of analytics projects by non-IT super users (analysts, data scientists, and line-of-business users with BI and reporting tool skillsets). Consequently, analytics hubs focus on simple drop-down menus, avoid coding for access to data, and allow for self-service, particularly for pick-up files that are largely under the control of the end-user anyway.
Without the ability to easily get data from multiple data sources and tie composite elements of data from those various sources for presentation out to the analytics tools, you don’t really have an analytics hub, chiefly because you don’t have a data hub.
See also: Okay, Your Data Is in The Cloud. Now What?
Today’s data management solutions have fallen short
While all four solutions serve certain functions, none fully meet the needs of business end users without significant support from IT. Put another way, the single shared virtual repository of data accessible by a multi-disciplinary set of users, tied to their preferred analytics tools of choice promised by data lakes, data warehouses, and data hubs, doesn’t exist. A true analytics hub – either on-prem or in the cloud – has yet to be realized.
While data lakehouses solve some issues, they are not a universal remedy.They really are the next generation of data lakes, incorporating some features and functionality found in data warehouses but with an eye toward data science. Data science is becoming increasingly important, and the role served by data scientists is one of exploration, pushing the boundaries of what questions are asked and which data and how it will be incorporated into operational analytics and process automation, and orchestration. But, as data use increases, so does the need for functionality inside a data hub and a data warehouse.
The real solution – particularly given the rapid cycle of continuous development and continuous improvement – is to blend elements from all of these technologies into one.
Next generation innovation: the rise of the data analytics hub
Instead of just a data hub usable only by IT or an analytics hub usable by non-IT but without comprehensive data integration, what’s really needed is a data analytics hub that is used by a broad array of IT and business across multiple datasets. A data analytics hub draws elements from the four technologies: data hubs, analytics hubs, data lakes, and data warehouses.
Like a data hub, a data analytics hub provides connectivity to disparate data sources in both batch and streaming modes. However, unlike a data hub, a data analytics hub provides persistence in a cloud repository. Further, it provides curation for a diverse set of disparate data types that may be ingested in both batch and streaming modes with self-service, low-to-zero-code options through drop-down menus for non-IT users.
Like a data lake, a data analytics hub’s cloud storage repository can handle all data types and leverage industry standards for data movement and analysis (a la Kafka and Spark). However, unlike today’s typical data lake, a data analytics hub also provides structure and support for end-user-facing BI and advanced analytics workloads through the use of SQL (more in the manner that a cloud data warehouse does). In essence, it’s a bi-directional hub, supporting multiple inputs and outputs, solving for all permutations of input data and output tools used by a diverse set of non-IT users.
Indeed, a data analytics hub provides downstream – meaning in the direction of the end user – support for the most popular BI, reporting, visualization, and advanced analytics tools. However, unlike today’s data hubs, data lakes, and data warehouses, a data analytics hub provides user-friendly self-service tools that enable non-technical users to link any data source to any end-user tool — without the need for IT intervention, on either a one-off or day-to-day basis.
In short, a data analytics hub gives organizations the ability to store, manage and analyze data on a holistic basis.
It combines the critical data collection and analytical features of these well-known solutions but exposes all those features in ways that key business users can access easily and incorporate into programs and processes. It’s a platform that takes the optimal operational and analytical elements of data hubs, lakes, and warehouses and combines them with cloud features and functionality to directly address the real-time operational and self-serve needs of business users rather than exclusively IT users.