Sponsored by Cloudera

Data Immediacy’s Next Step

PinIt

Data immediacy goes beyond the concepts of streaming data and data-in-motion. It adds the dimension that data has a very specific time value. The less recent the data is, the less valuable it is. Here are the issues you should consider to implement data immediacy and realize its benefits.

Organizations are finding they need to go beyond data in motion to support data immediacy, which refers to the ability to take action on data as it is generated in real time. However, there are numerous challenges to achieving data immediacy.

RTInsights recently sat down with David Dichmann, VP of Product Marketing at Cloudera, to talk about why there is such great interest in data immediacy, some common obstacles organizations encounter, the need for robust hybrid architectures and open data formats, and techniques to operate on data wherever it resides, minimizing the need to move data.

Here is a summary of our conversation:

RTInsights: Let’s catch everyone up by talking about why there is a need to go beyond what folks have been calling data in motion to support data immediacy.

Dichmann: Data in motion is about how data moves throughout the enterprise, but it doesn’t necessarily mean that this data has strict timeliness with it. I can move data from the edge, from a sensor on a device, and move that through some sort of intermediary stage where some analysis is performed, and I can take action on the derived insights. For example, I might determine that a piece of equipment is approaching a failure point and do predictive maintenance.

Ultimately, that data will move to a centralized environment where historical analysis can take place, or it can be used to train models for AI. I might develop predictive models for determining when the next out-of-calibration issue or near-to-failure condition occurs that we can proactively correct.

So, when we think about data in motion, we’re talking about all kinds of data flowing throughout our enterprise, going from one place where it’s born to a place where it’s used to where it ends up being stored at rest.

However, when we think about data immediacy, we add the dimension that this data has a very specific time value. The less recent the data is, the less valuable it is. The most common analogy for this is the stock market. If I gave you stock prices today, that’s useful. But if I gave you a stock price, right this very second, that’s more useful. And if I gave you information that informed you what the stock price will likely be tomorrow, that’s extremely useful.

Suppose we can view the data as it’s happening and respond to the data immediately or as close to immediately as possible. In that case, we can move from being reactive to being proactive. That’s where data immediacy comes in.

RTInsights: Does embracing data immediacy require a higher level of data maturity?

Dichmann: Yes, it does. Whenever we’re trying to move from a reactive to a proactive state, we need better-quality data.

Let’s take an example from cybersecurity. One very large organization we’ve worked with told us that a single breach can cost up to $1 billion in impact. Before looking at solutions to improve their data immediacy, they were taking up to seven minutes for something that happened in their network to determine if that interaction with their computer systems was potentially fraudulent. That means it took them over seven minutes to detect a breach. Seven minutes is a lot of time. They cannot always easily roll that back and fix things; the damage is done, and that’s where those impact costs come from.

 We can’t wait seven minutes to commit to every interaction on our network; that would cause a massive delay in our business continuity. Whether that’s a financial transaction,  windmill, or solar panel telemetry from an energy supplier, or life sciences data from a patient hooked up to monitors, the longer we wait to determine what the signals mean, the greater the risk we’ll miss insights from what that data is legitimately for.

To become proactive, we need three big things.

First, we need to have large quantities of historical data at higher levels of granularity to build AI models that find patterns and correlations that predict the future, which means we need to have a certain level of data maturity in our current estate. We need to ask ourselves what both our data management maturity and our metadata maturity look like, so we can have that high-quality data and the knowledge of how to use it available to train these models.

Second, we also need to have maturity around our data in motion ecosystem because we want to move the detection point as close to the point of action as possible. For example, we’re moving AI models directly into the devices at the edge. In that way, as the device is generating data, an AI model can run and determine what that data means before it even gets to any kind of central analysis. That allows us to be more proactive, detecting and reacting in real time.

The third thing we need is to be able to do this anywhere and everywhere our data may be. One of the things we’ve seen, especially around the advent of cloud, is we’ve broken down our data estate into multiple silos. We gave up control in favor of the freedom that the cloud provides. The challenge today is that we want to regain that control so we can better use all of our data without constraining the valuable freedom provided by the cloud.

To do that, we need to look at hybrid architectures and end-to-end data lineage.

This is all about moving from detection to prevention. Our customer was able to move the time to detect from those seven minutes down to milliseconds. They were able to look at every interaction with their network, make a judgment call in milliseconds, and then commit the transactions in real time. They can now block cyber attack attempts before they’re allowed to be committed to their systems and carried further down the line. That’s a game changer for them, and a lot of other organizations that are looking to be able to protect their businesses better. Another example is how some of our life sciences partners can detect instances of potential sepsis in hospitals faster and save lives.

RTInsights: What data architecture issues are critical for data immediacy?

Dichmann: The first is to start thinking about a true hybrid architecture. Let me explain what I mean by true in this context.

We have a lot of larger organizations that are highly regulated, where on-premises data will continue to be an important part of the ecosystem. We continue to have data born, analysed, and stored on-premises, even while significant data use in the cloud is growing.

Hybrid data estates exist –  doing things in the data center, and doing different things in the cloud, where we’ve figured out how to make those environments work together, yet in a delicate balance.

That balance can break easily if you make changes in the data center that are no longer compatible with the cloud, or changes in the cloud that are no longer working with the data center, causing breakages, delays, and downtime.

True hybrid is doing the same thing in your data center as you are doing in the cloud, ensuring the entire data estate operates fully compatible with itself, all of the time. Newer technologies are delivered across the estate in lockstep, so that you don’t miss a beat.

The second architectural dimension is being able to work as close to the data as possible. We need to think about what technologies we apply to our data everywhere and where that data may be. Things like query federation technologies allow us to operate on the data where it lies. Being able to take whatever analytics we want and bring that close to the storage, reducing the amount of data we need to move and duplicate throughout our enterprise.

We need to look at cloud-native technologies and bring those into our data center, such as containerized services that scale up and down, and take better advantage of our hardware investments. Also, separate compute and storage will allow greater flexibility to use our data for multiple different types of tasks.

I’d also like to highlight the Apache Iceberg table format, the de facto standard for large data sets in the cloud, allowing for our data in motion to interoperate with traditional data preparation and analytics on the same data. We don’t have to move and transform the data, and it’s all in the same format.

So fundamentally, think hybrid, think universal, and think open when updating your architectural standards. Think about open table formats that allow you to work on the same data at the same time. These architectural considerations will get you closer and closer to data immediacy.

RTInsights: What challenges do companies face in their data immediacy efforts?

Dichmann: Organizational challenges are the first to come to mind. We’ve been talking a lot about technology until now, but there’s an important people dimension. Are we well organized to share across departments to be able to make the right data available to the right part of the business at the right time? That’s usually the first thing we need to face.

It’s now critical to be able to break down some of the organizational silos that have been formed around the company and create an environment where data literacy and data sharing are more present. This will certainly lead to better sharing, use, and insight from all of our data.

The other challenge is, of course, the data quality. One thing we must remember when we think about data immediacy is that it’s not just whether we got the data fast enough, but whether we got it in a form that we can use. Did we prepare it for our use case? Did we get it properly shaped and sized for structured analytics in the form of a warehouse, for example? Or maybe it’s in vectors because it’s informing the RAG component of a generative AI model that we’re using for prediction or generated reaction steps.

Whatever it may be, the data must be ingested, processed, and transformed as close to real time as possible. It then needs to be analyzed with the right tools in real time to achieve data immediacy.

RTInsights: How does Cloudera help?

Dichmann: First and foremost, Cloudera is built hybrid first. We’re the only open hybrid platform for data, analytics, and AI.

The first thing we bring to the table is our approach to “true hybrid.” We deliver the same capabilities to any cloud platform as we do for the data center. This means that if you write business applications against Cloudera, you can run them on any cloud or on-premises, and you can move them seamlessly from one cloud to the other and from the cloud to the data center. That means you’re doing the same thing everywhere on the same data formats using the same tools and techniques.

With that comes a single hybrid control plane-one view of security, governance, and lineage-one view of your entire data state, wherever that data may be. That gives you control without limiting you to having to deploy data management all in one place. You can increase your use of multiple clouds and incorporate your data center without segregating and segmenting your data estate into multiple different siloed deployments. That’s true hybrid.

The second thing we do is provide a unified data fabric. This is where our shared data experience comes into play. We provide a single view of all data across all clouds and our data center for security, governance, lineage, and profiling. As data is moved, transformed, and changed, we’ll keep the original security and access profiles in place along the way.

For example, if I generate tables from another table, those tables will inherit the security tags and the appropriate permissions that go with them. That means I can have confidence that as my data flows and moves throughout the system, it’s properly secured. That increases my data liberty. It seems counterintuitive, yet it is true that the more secure I am, the more data I can confidently share across the enterprise.

There is a third key dimension of how Cloudera helps. It is a combination of our data in motion stack, our Open Data Lakehouse, and Cloudera AI. Cloudera brings all the core tools and services you need for the end-to-end data lifecycle all together in a single integrated data ecosystem.

Cloudera data in motion is about everything in real time, from ingesting the data and transforming the data in the data flow to analyzing it on the flow. Combine that with the Open Data Lakehouse and the Iceberg table format, providing a richness of data formats and variety in one place, under one roof.

 Also seamlessly integrated, Cloudera AI helps you build AI with Cloudera and run AI on Cloudera, including running AI at the edge. You can now deploy advanced analytics, data science, machine learning, AI, and generative AI on the same data and in the same infrastructure. So, end-to-end, you get to treat your entire data ecosystem as one environment, and bring all that data to bear for the most advanced analytics possible, in real time.

See also: Navigating the Data Immediacy Readiness Scale

RTInsights: What are the benefits of these capabilities from Cloudera?

Dichmann: The most obvious benefit is that you have data immediacy. You can use all of the data in your enterprise exactly when it happens, as events occur and data is streamed in. For instance, as telemetry comes in from your devices, you can determine how to optimize our shipping and delivery routes. As health data comes across the line, you can automate and react more quickly to medical emergencies in hospitals or be more responsive to our patients’ needs. It makes things run better, work better, and you can finally move from being reactive to being proactive. That’s what it means to be able to adopt a data immediacy approach.

Cloudera overcomes all of the barriers that get in your way, allowing you to take advantage of the data happening throughout your enterprise. You can now see where all that data is, bring it together logically or physically as needed, and incorporate a real-time dimension to using it.

It’s not just about what is happening right now; it’s about why it is happening. What does this immediate data mean to you? And how should you respond to it? AI is one of the best tools in the market today, and it allows you to take advantage of that.

We can meet our goals, whether it’s being able to reduce operational costs and save money, whether it’s to increase innovation and enter new markets, or whether it’s to defend yourself against cyber fraud and cyber threats. When you have Cloudera and Cloudera’s advantages, you’ll have data immediacy. And if you have data immediacy, you’ll be able to achieve all of these goals.

RTInsights: Anything else we should consider?

Dichmann: One thing we didn’t touch on is relevant to today’s marketplace, and it’s about Retrieval Augmented Generation, also known as RAG, and its importance in real-time and streaming AI. RAG allows us to add additional information to any AI prompt, or interaction, and for real time and streaming AI, we need that RAG vector database to be filled with the absolutely latest information possible. There’s a superpower in NiFi 2.0 that’s designed to help you stream directly into a RAG environment by directly writing into a vector database. You want to do that because your AI is not just responding to what it knew the last time it was trained. AI can now take the most immediate information into account as it provides an immediate response to things happening in your enterprise. AI that knows what’s happening in real time will be the standard, not the exception, as we move towards agentic AI.

Salvatore Salamone

About Salvatore Salamone

Salvatore Salamone is a physicist by training who has been writing about science and information technology for more than 30 years. During that time, he has been a senior or executive editor at many industry-leading publications including High Technology, Network World, Byte Magazine, Data Communications, LAN Times, InternetWeek, Bio-IT World, and Lightwave, The Journal of Fiber Optics. He also is the author of three business technology books.

Leave a Reply

Your email address will not be published. Required fields are marked *