From Data Warehouse to Data Mesh: Usable Data is Still Key

PinIt

A data mesh flips the script on centralization and having a monolithic data structure by decentralizing data management to the various business domains across the enterprise.

Using data is hard (clearly illustrated in the annual Big Data and AI Executive Survey and other similar studies). Businesses know this to be true and have spent the last three decades trying to make it easier, eagerly gravitating towards the next “platform du jour” that promises to enable greater data access and analytic insights. First came enterprise data warehouses (EDWs), then the various warehouses and lakes designed for the cloud, and now data mesh is all the rage.

Time and time again, these approaches have led to differing schools of thought—each with its own branded frontrunners and vocal industry advocates—as to how businesses should manage and execute on their data. Separating the hype and what should be adopted can prove tiring. And we can expect to only cycle through more emerging solutions in the future until organizations tackle a fundamental and overlooked challenge within their data stacks: data usability.

The centralized school of thought

Maximize the return on your Snowflake investment with insight into performance,  quality, cost, and more. [Learn More]

The traditional EDW introduced the idea of integrating structured data into a single place that would make it easy to access for Business Intelligence (BI) reporting. The data would be highly curated, meaning organizations would only populate their EDWs with data deemed necessary for specific BI reports. While this helped to conserve resources and costs, it also meant dropping other valuable, related data that could provide deeper, beneficial insights.

Looking to aggregate and leverage even more of their data, companies migrated the concept of the EDW to the cloud. Largely, these businesses saw and wanted to emulate the successes of the digital-native FAANGs of the world, who leapfrogged the competition using comprehensive data in the cloud to guide business decisions and hyper-personalize products and services for customers. But cloud-based EDWs were still limited to only structured data, leaving out the vast riches of unstructured data across the modern business. Most organizations, therefore, end up just replicating their same on-premise BI reports rather than achieving anything transformational.

In 2010, data lakes emerged as a promising solution, where organizations would consolidate all raw, unstructured, semi-structured, and structured data into one central location—available for use in analytics, predictive modeling, machine learning, and more. However, data lakes also became likened to “data swamps” because they often ended up being expensive dumping grounds for any and all data due to poor setup, governance, and management. The data would be far from usable, creating mistrust in the data quality and the resulting insights or solutions.

Anyone who has experienced limitations in BI reporting or a data swamp won’t be surprised to learn then that a TDWI Research study of 244 companies using a cloud data warehouse or lake found 76 percent experienced most if not all of the same challenges as their on-premise environments.

See also: Data Fabric vs. Data Mesh: Key Differences and Similarities

Decentralizing with a data mesh

Originally proposed by Zhamak Dehghani of ThoughtWorks, the data mesh flips the script on centralization and having a monolithic data structure by decentralizing data management to the various business domains across the enterprise. A data mesh’s goal is for each business domain to treat data as a product that they can transform, use and make available to other cross-domain users.

The thinking is that your business domain experts would know best if the information is up-to-date, accurate, and reliable and can better provision the right data at the right time. In a fully centralized approach, they would be reliant on data teams, who are often limited in available resources and need to juggle numerous competing requests from other business units, which can cause delays. With data mesh, however, there’s no more querying data from a massive data lake, so users can act on the data closer to where it resides, accelerating the time to insights and value. Weaving the mesh together is federated computational governance—core organization-wide standards, rules, and regulations to ensure interoperability among the domain units and data as a product.

It’s important to note that data mesh is not a single out-of-the-box solution but rather an organizational approach that can take multiple technologies and may even include a data lake. Since the approach is radically different from what organizations are used to, change management is required, including obtaining buy-in from your domain experts who have been used to consuming reports rather than doing the data engineering and science work themselves. Data upskilling within the domain units will therefore be necessary for this decentralized model to be a success.

See also: The Promises of Data Fabric in Digital Transformation

Data usability is still a prevailing issue

While data mesh may sound principally different from the cloud data warehouses and lakes that have long dominated the industry, these approaches present similar challenges that underscore the need for data usability.

Maximize the return on your Snowflake investment with insight into performance,  quality, cost, and more. [Learn More]

The fundamental issue is that data in its raw form is unusable. You have vast bits of data filled with errors, duplicate information, inconsistencies, and various formats all floating around in isolation across disparate systems. With cloud data warehouses and lakes, these bits are typically just moved with their existing problems from their on-premise environments to the cloud—warts and all. In turn, the data is still isolated and siloed, except it’s now all in one place. This is why people end up experiencing the same on-premise challenges in the cloud. These floating bits ultimately need to be ingested, integrated, and enriched to become usable.

The same transformation needs to happen with a data mesh—only, rather than central data teams doing the job, each business domain becomes responsible for its own data. The decentralized nature of a data mesh may introduce new complexities too. For instance, it can lead to business domains duplicating efforts and resources on the same datasets. Moreover, data products from one business domain can and often are beneficial to other domains. So beyond uncovering relationships between the datasets, users must also reconcile entities across data products between domains, such as when piecing data from different systems together to form a complete picture of a customer.

We’ve touched on the need to upskill business users within a data mesh. A shift towards more citizen data scientists may be needed—even among businesses that don’t adopt data mesh—simply due to the rampant shortage of data scientists, with the latest estimates indicating a 250,000 gap between job openings and searches. The shortage of talent, coupled with the proliferating amount of data across modern businesses, has left few organizations able to effectively use their data at scale.

Establishing a data usability layer

Whether your organization chooses to take a centralized or decentralized approach to enterprise data management, you ultimately need a way to link, integrate and make sense of all the bits of information from across your business. If you don’t have the talent available to do this critical work and the volume of data is overwhelming, then automation is something to consider.

Today, AI can be applied to automate the ingestion, enrichment, and distribution of data from all sources, managing every step needed to attain usable data assets. You go from having fragmented, floating bits of information to having them linked and fused within a metadata layer, or data usability layer, in your data stack—readying the data for use in reporting, analytics, products, and services by any user.

A data usability layer sits alongside any cloud data warehouse, data lake, or data mesh environment. It enables businesses to optimize whatever strategy they choose for their organization by enabling you to understand, use, and monetize every last bit of data at absolute scale.

Koert Kuipers

About Koert Kuipers

Koert Kuipers is Chief Technology Officer of Tresata. He has been with the company since 2011. He possesses skills and expertise in data mining, quantitative development, and actuarial analysis. Kuipers continues to evolve the Tresata Digital Business Platform and help businesses use 100% of their data to personalize products and services, augment human intelligence, and automate the discovery of knowledge.

Leave a Reply

Your email address will not be published.