Build Better Data Products with Data Reliability
As businesses make greater use of analytics throughout their organizations, care must be taken to ensure the right data is being used at the right time. This has elevated the need for data reliability to new levels. In this video, RTInsights’ Helen Beal discusses the topic with John Morrell, senior director of product marketing at Acceldata.
Helen Beal: Welcome to this RTInsights conversation with Acceldata on data reliability and how to deliver data on time at the highest possible quality. I’m Helen Beal, and today I’m talking with John Morrell, senior director of product marketing at Acceldata. So, John, with data volumes proliferating at such a dizzying rate, what are characteristics of cloud data platforms that help people access the value from their data that leads to those better decisions that we all want?
John Morrell: Well, the real value of cloud data platforms is in that it provides organizations with a much more agile approach to managing their data, as well as delivering it to analytics and business teams. You can spin up resources on-demand with cloud data platforms; you only need to use the resources as they are needed. So, this helps deliver agility.
The result of using a cloud data platform, or even migrated to a cloud data platform, is that you get
much more agile data teams and processes. You can deliver new analytics projects much more quickly.
And that facilitates greater consumption and use of data within the business. That’s the whole data
democracy thing, right? Cloud data platforms give you the ability to facilitate much more data-driven decisions, and the ability to proliferate data throughout the business.
What is data democracy
Helen Beal: You mentioned data democracy, could you explain for the audience what that means?
John Morrell: Data democracy is about trying to let more people use data. And it is also lets people access more data. Now, it doesn’t have to be data in its rawest form. It can be data that’s been curated for them. The point of data democracy is to make sure that when somebody makes a decision or decides to take a certain action, that they’re actually backing it up with data. And, that they have all the right data in front of them so that they can move quickly.
What is a data pipeline and why does it matter?
Helen Beal: So, we’ve got all this wonderful data, and then we’ve got all these wonderful people that want the data, how do we get the data from where it is to those people?
John Morrell: Well, that’s typically something called a data pipeline. At its basic level, a data pipeline works its way across what we call a modern data stack. I’m sure you’re going to ask me in a minute what a modern data stack is, right?
But, it (a data pipeline) consists of components that the raw data from all these applications and other sources, and then tries to land it in either a cloud data warehouse or a lake house, you know, one of the many cloud data platforms. And then, the pipeline transforms that data into some form that’s useful for analytics.
I used the word curated just a few minutes ago. Curated data is the data that an analytics person can use and work with, and then stop and say, “Oh, I can know how to find the answer.” And then, deliver this data to the actual analytics tools or other applications so it can be actually consumed.
Within a data pipeline, data assets come in multiple forms. There’s data at rest, data in motion, as well as data to be consumed. And it all comes from the raw data, from the different data sources, and it’s data that’s moving through pipelines, it’s data in the cloud data warehouse that’s transformed already to the final analytic dataset. So, data assets get proliferated in this process, because they’re going through many different forms within a data pipeline.
What’s the difference between a modern data stack and a traditional data stack?
Helen Beal: Wonderful. And you did mention the modern data stack, so maybe you should define what a traditional data stack looks like?
John Morrell: Right. So, a traditional data stack had to do with data on-premises and focused on the source data. You would have a data integration platform, or ETL platform. But then, it would make some certain stops along the way in to get that data into data warehouses or databases. And then finally, the data would get delivered down to either a data warehouse or a data mart.
In the more modern cloud world, you would have what’s known as data loaders, which take data in its raw form from the applications and delivers it into a cloud data warehouse. Then, once it’s in the cloud data warehouse, it will get transformed with transformation tools. Once it’s transformed, it’s in a consumable form. Then from there, the analytics tools can access it and work with it.
There are also things like orchestration tools that can orchestrate your data pipeline. There are also data observability tools, like Acceldata, that help you monitor and manage everything that’s going.
What’s driving the need for data observability tools?
Helen Beal: I do always like to make sure we clarify what acronyms are that we use, and you mentioned ELT, which is extract, load, transform. And I think one of the fundamental differences you’ve described with the old and the new is that we used to ETL, extract, transform, load, and now we ELT, extract, loading, transforming. Is that the right way round?
John Morrell: Correct. A modern data pipeline is really more of ELT, as you mentioned. And businesses use these pieces of the modern data stack to work its way through that process. The data hopping between a variety of platforms and tools as it goes through the pipeline.
This is what adds a little bit of complexity to the modern data stack. In the modern data pipelines, you’re really now hopping around between more places and more tools. And it’s difficult to get visibility into what’s going on with all of that. There are some tools that come with your cloud data provider or other things. But really, what you need and what people are looking for from data observability tools is to give them that visibility across that entire sort of set of components within the, the modern data stack.
Going to things like multi-cloud, or even a hybrid cloud, further exasperates this because again, the data is more spread out, it’s in different places, it’s in different clouds, and even on-premises. You need to be able to get the singular view of everything across your entire modern data stack.
See also: Multi-dimensional data observability
How data observability platforms work
Helen Beal: You’ve mentioned observability a couple of times. Let’s talk about unknown unknowns and silent failures. What are the causes and the nature of these things? And how can teams prevent, avoid, and remediate these things happening to them?
John Morrell: Silent failures is a great term. In our world, it’s when data problems occur, but nobody knows about them. The data team doesn’t know about it, the data engineers don’t know about it. The only time they find out that something bad happened is when somebody in the business is screaming saying, “Where’s my data?”
The problem could be the result of a number of different issues. It could be things like issues with data quality. It could be issues with the data pipelines because they’re just not executing properly or taking too long to execute. You could have data schema problems, or even what we like to call data drift, which is when data goes beyond the normal bounds of what you’d be expecting. And then, if your data pipelines are having problems, you have got data timeliness or data freshness issues.
Data teams can use tools and platforms, such as data observability ones like ours, to help prevent and avoid these problems because they get this insight into the problems. They can get alerts when problems occur. But then, they can also dig into the data and remediate what’s going on. They can immediately fix the problem.
And so, a data observability platform provides this complete visibility into the assets and the data pipelines. It allows data teams to apply data quality policies at scale across all of data assets. The data teams get alerts to problems when they happen, so they know they happen, so there are no silent failures. They know when problems occur.
A good data observability platform will also give you a deep multilayer set of data. Because it is not
enough to just know that a problem occurred. You need to be able to dig into the data and say why the problem occurred and help resolve the problem. If a problem occurred because a data pipeline was taking too long, well, you need to understand why that happened.
Perhaps you didn’t have enough compute resources allocated to the data pipeline at the time, or some of your compute resource went away, or there were too many other things banging away at the cloud data warehouse at the time. You need to be able to stop and correlate data and compute with failures, and understand, “Oh, this is why it happened. I’ll make sure this never happens again.” And then, you can also facilitate automated responses to everything to just let the problems get fixed by themselves.
Helen Beal: These are exciting times. It’s always a trade-off, isn’t it? On one side we’ve got this really rich data, these rich platforms and architectures, but with it we’ve got volumes and complexity. So, for people that are coming into this new, where do teams typically have blind spots where it comes to data? And what are the big gotchas that they need to watch out for when they adopt these new cloud data platforms?
John Morrell: We see probably six key places where they have these blind spots. One is performance. It’s in throughput and latency. Do I have enough resource allocated to make sure that this pipeline gets executed properly? And, is it taking too long? This is very typically a blind spot.
Second is in what we like to call data quality. Certainly, data quality’s been around for a long time, and there have been tools around to help with it. But, with more and more data, more and more complex data formats, and more movement of data, data quality takes on a whole different aspect.
We have something that we’d like to call shift left, which means you can shift all the way left to the source, the, the set of data, and figure out whether or not that data is of the highest degree of quality or meets the quality standards. And if there’s a problem, then we can isolate it immediately before it even starts to go through the rest of the process before it gets into the data warehouse. And you can then remediate it and fix it, and then keep things moving from there.
There are also issues around what I would call data reconciliation, which as data moves, you stop and say, “Hey, wait, this data isn’t matching up with that data. The data I’ve got in this particular column doesn’t match with the column where it came from.” Or something like that. And so, you need to be able to stop and reconcile what’s going on with the data.
Another blind spot is in the health and execution of your data pipelines. Are they executing properly? Are they executing in a timely manner so that the data gets delivered?
There’s also what we like to call scheme and data drift. Schema drift is when the schema, the upstream schemas tend to change. That has a ripple effect all the way through the data pipeline.
And then there’s data drift, which I’ve mentioned before, which is where data suddenly goes completely out of the bounds that you would typically expect or out of the distribution.
If you’ve got a dataset that’s got a region field in it, but all of a sudden during one pipeline execution there’s nothing from Europe, that means that data’s gone way out of the bounds and something went wrong.
And then the other key aspect, the other key blind spot many people have is in the performance and costs of their cloud data platforms. There have been a number of instances we’ve heard people just let their cloud data warehouse loose without really knowing what the costs are going to be. The data teams, business teams, and analytics teams start pounding away at it, and they don’t know what’s going on with cost. And then they get the bill at the end of the month, and they look at it and they’re like, “Wow, how did this happen?”
Because cloud data warehouses are somewhat new, and you have the ability to spin up resources on-demand or scale as they need to, the problem you have is you can end up just going too elastic and applying too much resource at a particular point in time. You want to be able to understand what’s going on there to eliminate that blind spot.
Helen Beal: Perfect. So to recap from the audience, the top six things to keep your eye out for. One, latency.
Two, quality. Three, reconciliation. Four, pipeline health. Five, schema and data drift. And finally,
performance and cost at number six.
We’ve got time for one more question. The world is increasingly regulated. I think organizations are
feeling this all the time. So, what do data governance teams need to do to ensure that their organization is compliant with these ever-changing regulations? And, how can they ensure adaptability because they’re constantly changing?
Data Governance in an age of constant regulatory changes
John Morrell: And these regulations constantly change as well. I’ve been looking at data governance for the past probably eight years of my career, both at here at Acceldata and at my previous companies. And the issue of figuring out how to govern your data just continually gets more and more complex.
Many people associate data governance with making sure data is secure. But there’s another key aspect to data governance which is the ability to monitor the data. That means, how is used? Where is it sent? Where does it move to? Who is consuming it? Is the data up to date? Does it have a high set of quality to it?
So, data governance is getting more and more complex because there’s a lot of other things that data governance teams need to look at just besides, “Hey, is the data over in this one place? And it is secure and only the right people are seeing it?” These days, it’s really more about where’s the data moving to, how is it being consumed, and how is it being used?
And if you’re lacking any of these aspects, then the data can easily come out of regulatory compliance and be at risk. That’s, again, another place where data observability doesn’t give you a full data governance view of things, but it helps with data governance by giving you the ability to understand where your data is moving to, how it’s being used, and, and how it’s being consumed.
Helen Beal: Perfect. Well, this has been an extremely interesting and very pertinent conversation. I particularly enjoyed talking about lake houses. They’re a nice new word I think that’s just emerging at the moment. Thank you very much for your time, John. Thank you everyone for listening and see you again some other time.