Using Data Observability to Control Costs & Increase Data Reliability in Cloud Data Platforms
Businesses are moving data to the cloud to make it more accessible and available throughout the organization for analysis. Unfortunately, many issues can arise with traditional data management approaches. Costs can quickly get out of hand. There is no insight into data quality or control. And it is often difficult to make the data available for today’s modern data and analytics pipelines. In this video interview between RTInsights and Tristan Spaulding, Head of Product at Acceldata, we look at the issues businesses face today and the role data observability can play to control costs and increase data reliability.
0:50: Data Trends and Optimizing DataOps
3:15: The Modern Data Supply Chain
9:07: Cloud Migration Challenges and Opportunities
15:24: What is a Data Pipeline and Who Manages It?
17:07: Building Data Pipelines: Getting Started
Helen Beal: Hello, and welcome to this RTInsights conversation with myself, Helen Beal, and Tristan Spalding of Acceldata. Today we’re going to be talking about data pipelines.
Tristan, would you like to tell the audience a little bit more about yourself?
Tristan Spalding: Sure, and thanks so much, Helen. So I’m Tristan Spalding. I’m Head of Product at Acceldata, where we focus on data observability for the modern data stack and data pipelines running on that.
So prior to Acceldata, I was one of the early product managers at DataRobot, really working on productionizing these machine learning pipelines for some really high-stakes applications. And so it’s fascinating today, what’s out there and some of the challenges, but also some of the opportunities to really make use of this technology to drive new business outcomes.
Data Trends and Optimizing DataOps
Helen Beal: Amazing. And data is such a big, excuse the pun, topic, isn’t it?
We’ve got big data exponentially exploding all of the time. We get all these different things, we’ve been talking about data warehouses and lakes in the past. Increasingly, I’m hearing people talk about lakehouses. We’ve got data fabric and mesh. So how do all of these things impact people’s data environments and how should they be used to optimize data operations?
Tristan Spalding: Yeah, I mean I think the growth of so many terms and so many different angles, I
think, speaks to two things. One, the importance of data today. So it used to be that there might be a couple of patterns that you follow, whether that’s transaction processing or OLAP. But now, there are so many different ways to do it. It’s so much a part of the lifeblood of organizations that there are a number of different approaches.
I think the second part that really comes out is that there are so many tools out there, there are so many options now to optimize this in different ways. And so it’s a much more complex choice, especially for those organizations that didn’t start their data journey in 2010. For organizations that have been using data, insurance companies, financial companies, all across the board, retail companies, have been using data for forever, and they have years and years and years of investments at a very complex stack.
See also: Multidimensional Data Observability
And so thinking about how you move these things together, I think become… and how you optimize for the next phase, I think is a huge concern. But also, a huge opportunity.
The other aspect that I think comes into this is all of these terms really provide a different angle on this, but all of them really speak to different skillsets that people are developing. So people that, in the past, might have done sort of drag and drop ETL pipelines, well, now you can really recruit and bring on data engineers who write Python code or write Java code or even Scala code.
And so I think all of these different terms are really ways for how do we use that technology that’s out there, the skills that are out there to solve some of these newer problems in a way that’s there. So I understand where it’s tough to make sense, is this a warehouse, is it a lakehouse, is it a mesh, is it a fabric? And to some extent people are putting their own angles on each of these, but I think what it speaks to is just how important this is and how big the choices are.
The Modern Data Supply Chain
Helen Beal: So it’s very complex isn’t it? There are lots of tools, there are lots of concepts, patterns, practices, there are lots of sources of data and different types of data. But what are the components of a modern data supply chain?
Tristan Spalding: Yeah, so I think the first… There are really several here. So I think the first one is, that we can’t forget about, is where the data actually comes from. So the systems and applications and devices even these days, that are producing data and all of this stuff upstream to the left, if you follow the typical diagram, that’s a really critical component, and oftentimes, that’s getting streamed from where it originates into some of the next phases we’ll talk about.
But I think that’s a really key piece of this, that sometimes we tend to focus on the last phase, which we’ll get to in a second. But those upstream components and the mechanisms that ferry them quickly or slowly or better or worse or expensively or cheaply, from where they originate to through the warehouse processing layer, I think those are quite critical.
I think the second component that’s quite important, and this is where you see a lot of increasing overlap between some of the major players, let’s say between DataBrick’s and Snowflake’s specifically, is kind of a generic code-centric, data wrangling and data processing layer. So something where things aren’t quite ready to be queried with SQL or done in reporting ways, you might be doing some more complex pivoting or sort of aggregation or connection between different data sets.
And this is something that often people are expressing in Python or Java, and often Spark is used for this, but increasingly Snowpark, out of Snowflake, is something that also tries to solve this problem.
So I think there’s this layer where you’ve got things emitting data, you’ve got things moving that data at speed, you’ve got a layer where you’re writing code essentially to really structure this, and then you’ve got, of course, the famous warehouses or the lakehouses. And where do you draw the line on this? Increasingly these companies are kind of blurring the line on this, on whether it’s a lake, where the data is raw or it’s a warehouse where you’re querying it.
But certainly there’s a huge number of options now for cloud-based data warehouses that really have a lot of the same traits and some excel in some ways, some in others, but they all have the traits generally of pay as you go pricing, of doing a lot of the optimization and scaling for you behind the scenes. So it’s a lot easier to author.
And so I think that takes us to the last layer, which at Acceldata we think quite a bit about, which is what’s the offering layer? So someone on top of this, you’re working across different systems usually, how are you authoring and orchestrating those transformations? And so in many cases people are using… Well, one, they have pipelines that might have existed for quite some time, for decades even, in classical ETL tools. Then you have orchestrators like Airflow and Dagster and Prefect, where people are orchestrating these more complex code-based data pipelines.
Then of course, you have tools like DBT that are making it really easy to author a lot of queries. So in one sense that’s great, you can express things a lot faster as an analyst than in the past. On the other hand, from the finance side you might get a little more curious about, hey, what’s going on underneath these? How optimized are these? Things like that. So I think those are all the components that kind of take you from a device to a data generating system, let’s say, which could be a device, could be an application, all the way to, I’ve got data, I know what it’s doing, it’s out
there and it’s available.
The only last piece I would then throw out there, is how is that data being used? And so I think this is one of the things that is evolving quite rapidly and offers one of the biggest opportunities, at least that we see in our customer base.
And that’s where, in the past, you might have had 80% of the workload might be around basically dashboards and reporting in analytics. I think, today you now see a growth in two more tracks. Obviously, that’s still important, that’s still present.
But what you start to see now that you didn’t necessarily see 10 years ago or 20 years ago, is A) the use of machine learning models. So actually, taking this data, and this is something we saw a lot of at DataRobot as well, but you see people taking these models and really applying them in a competitive situation.
So to offer the right price to a given customer based on their history, to prevent churn, these are operational use cases to forecast demand. Operational use cases for the quality or model and the quality or data really makes a huge difference. And so we see that more and more out there.
The other thing we see, and this is something that’s being facilitated quite a bit by some of the big players here, is data marketplaces. So people actually monetizing the data that they have as a source of revenue in its own right. And so there’s a number of ways to do this and obviously it introduces its own level of concern around how are we protecting this data? Are we anonymizing it correctly? Can we ensure when something goes out that shouldn’t, can we retract it? I think it also
introduces some concerns around quality.
So in both of these cases, the machine learning case and that data monetization piece, when you’re pushing data outside your organization, whether directly or through predictions, now there’s a different level of care and a different level of concern that you need. Because when that data is not good, when it’s late, when it’s broken, when it’s incorrect, it’s not going to be someone coming down and telling you, “Hey, can you check this number again? Can you check this dashboard? I think it might be wrong,” it’s going to be a customer leaving you for a competition that gets it right or someone… I mean, there have been some infamous maybe regulatory cases around this data not being used well as well.
So I think those are some of the big… Anyway, can you tell from the answer, there’s a lot going on in this space and certainly a lot of elements to keep track of.
Cloud Migration Challenges and Opportunities
Helen Beal: Yeah, loads going on. Of course, that’s mainly because of the digital transformation. We live in the digital economy now and data is very much a competitive differentiator, isn’t it? As you say, people are building whole companies around the monetization of their data as well. So a lot of compliance and regulation concerns alongside it.
And you mentioned kind of cloud-based data warehouses and, of course, the two key drivers of digital transformation really are DevOps and cloud. Let’s talk about the cloud piece a little bit more. So when people move their data into a cloud environment, so outside of their traditional data centers, what changes for them and can we do the good and the bad? So what’s good about moving and what’s dangerous or needs to be looked after carefully?
Tristan Spalding: Yeah, for sure. And this is the big question facing everyone. And I think one of the challenges that the companies in this situation have is that, in many cases, they’re fighting against basically cloud native incumbents or the cloud native challengers that don’t have the history of on-premise data warehouses and data platforms essentially. And so these groups, of course they started on AWS, they started on GCP, they started on Azure and they’ve grown with that. Whereas, I think a lot of organizations that are more mature and have a lot more data assets in some ways, now face this challenge A) of doing the migration itself, which takes years, it can take years. And the difference between that taking one year or two years and three years, I think is critical.
So there’s a ton that goes in just to that migration process. And one of the things we help people with is having kind of a foot on both sides. So we have visibility to on-premise, we also have visibility to cloud systems. And so I think accelerating that migration is something that’s top of mind for a lot of people.
But when it comes to cloud data warehouses, I would generalize it to cloud data platforms since some of these tools, the confluence, the Databricks, do a little more than a warehouse does. And likewise, Snowflake is starting to expand beyond that. I think the great thing about it is that it’s so much easier to use. I mean, it’s just the computing power available, and then, increasingly, the amount of abstraction and serverless behavior that these vendors offer is just amazing. You’re
able to do so much more so much faster than in the past. I think the other thing is how quickly this is upgraded. So every two weeks these companies are releasing upgrades on really any dimension of this. And you don’t need to worry about maintaining that. You don’t need to ask central IT to upgrade that. I mean these are amazing. There’s an amazing reduction in friction on top of all of this. And I think, you see the proof is sort of in people’s usage. You see how quickly the usage of these things grows. That’s because it truly is a much better experience for you as basically the person using the data.
Now I think some of the cons come back to those same elements. So I think moving quickly and generating a lot of workload and not worrying about the details, I think is great from a user experience level. But it also means that you can quite quickly rack up some pretty significant bills. So we’ve worked with a company that had a single query take $270,000 on one of these, I won’t name which platform, but on one of the platforms. Just because, it’s not a problem with the platform, it did what it was told, it did what it was supposed to. It’s just the controls weren’t there to kill that and there wasn’t a DBA looking at it in the way there might have been on-premise. And there wasn’t contention with other jobs in the way there would’ve been on-premise. And so, I think how quickly this stuff can grow, I think is one challenge. .
I think there’s also kind of a negative side to how quickly things improve, and that’s
that mastering these systems, staying on top of the latest and using the optimizations is quite difficult to do. I mean not many people, even if they’re Snowflake experts and Databricks experts, you were an expert six months ago, but not many people have extra time to keep on top of all of the changes that are going through, basically the release process and upgrade process, every couple of weeks as this comes out. So I think that visibility and staying on top of things, I think are a
couple of new challenges that people have with the cloud data platforms that they didn’t necessarily have before. So, that’s a couple.
I think the really big one, and this is really where we’ve been going with Acceldata the last year or so that I’ve been here, the number one thing that’s different about this is really what happens to data reliability. So data reliability, just meaning is the data correct? Is the data on time? Is it accurate? Is it within range? All of these checks, has it drifted? That’s a major annoyance on-premise. I don’t think there have been many people that say, “Hey, I love my data. I know that it’s always accurate, it’s great, it’s easy to maintain,” so that problem is still there in the cloud world.
But there’s a new problem in the cloud world. Which is, when all of these systems basically have pay as you go pricing. So when you run a query that costs you money or burns down credits or whatever the abstraction is, it burns it down regardless of whether the data in that query was good or bad. I mean, bad queries cost the same as good ones. I mean, that actually can cost more typically since there might be problems that cause you to scan more data or blow up joins and things like that. And so that’s new.
In the on-premise world, that slows you down because you need to rerun the query. In the cloud world, it slows you down because you need to run in the query and it costs you extra, it multiplies the cost. And so a lot of the things that we try to recommend for our customers are catch the data upstream, instrument your pipeline so that you’re catching the data before it goes through this whole process.
And that way you save yourself, not just time and angst around bad data, but actually money that you might have been spending on queries that were ultimately not going to be effective.
So I think that’s an interesting dynamic, but I think people are experiencing now as they move more workloads to cloud data platforms.
What is a Data Pipeline and Who Manages It?
Helen Beal: Yeah, there’s always a bit of a trade-off, isn’t there? That of all the beauty and joy of the new technology, but some gotchas to watch out for as well. You’ve mentioned pipelines a couple of times, I want to dig into that a little bit more. So could you describe what a data pipeline looks like? And also, who’s accountable for it, who builds it and manages it?
Tristan Spalding: Yeah. So I think, a data pipeline takes data from, usually, multiple sources and goes through a sequence of transformations on that data and ultimately loads it in a destination. And so this could be as simple as going from one source to one database with little transformations, to as complex as hundreds or thousands of steps across hundreds or thousands of databases where these things are coming together. So I think, classically, this was the ETL developer or BI developer’s role.
I think, increasingly, this is the domain of the data engineer, for the most part, who is the person who’s writing these pipelines and making sure they’re running effectively. And I think, it’s a tough job. I mean that’s a really scarce… If you look at hiring numbers and open job numbers, data engineer is one of the most in demand fields. Precisely because this data is critical and it’s really, really hard to have mastered all of the systems that are involved in each of these data pipelines. So I think that’s a person that underneath all of… There’s data scientists, there’s all this fancy machine learning and things like that, which I love and we all love, but to get the data from where it started into those models and back out to the real world is quite difficult. And that’s sort of what a data engineer does.
Building Data Pipelines: Getting Started
Helen Beal: Perfect. Thank you for clarifying that. I mean it’s such an interesting space, so fast moving. And of course, there’s so much treasure to be found within the data as well. So I’m sure the audience is probably pretty keen to be trying out some tools and figuring out what their data pipelines look like. So what sort of advice would you have for listeners about how to get started with data pipelines and cloud environments for data?
Tristan Spalding: Yeah, I think… For sure. I mean, I think there are, obviously, lots of capabilities that many people might have explored on non-cloud vendors, things like that. I think, at Acceldata, what we really look at is people that have some significant Snowflake usage or Databricks usage or some Airflow pipelines.
And what we’ve been doing basically, is offering a free trial here where anyone can sign up for several weeks here, connect these sources and just start to understand and monitor what’s going on with these pipelines as they come in. And with these queries, so which data is actually generating some value here? How much are we spending on some of these and how do we bend that curve in a way that’s going to help us stay within our budgets?
So I think, definitely, if people haven’t been looking at data observability, I think it’s a great time to do it. Especially, if you’re, as we look into next year, planning and budgeting, really having a handle on that, I find is a critical piece for organizations that are really the champions of these new cloud data platforms.
Helen Beal: Wonderful. So head over to your website for that free trial?
Tristan Spalding: Absolutely. Yep.
Helen Beal: Acceldata.com, I assume?
Tristan Spalding: Acceldata.io, actually.
Helen Beal: Oh, io. Perfect.
Tristan Spalding: We’re a startup, so we have the “io” on there.
Helen Beal: Nice. Good. Well, it’s been great talking to you, Tristan. Thanks so much for your time today. And good luck, audience, with your data travels.
Tristan Spalding: Great. Thanks