Do You Need to Process Data “In Motion” to Operate in Real Time?

PinIt

Quick answer: No, you don’t always need data in motion to operate in real time. However, some high volume/low latency real-time systems do use data in motion in the sense that data is processed before being stored.

Everyone knows that organizations are more successful if certain of their processes run in real time. They can put better offers in front of potential customers, spot fraud before it happens, re-route trucks sooner to save fuel and provide better delivery service, fix machines before they finally break, and increase revenue or reduce cost and risk in a myriad of other ways.

Networks and computers have been essential ingredients in the movement toward the “real-time enterprise” for more than four decades.

Now, many analysts and vendors point to “data in motion” as the best way to build real-time systems. In some cases, they’re right, but not always.

But what makes a system a real-time system? And do real-time systems need to process data in motion? Here, we answer those two questions so that business people outlining the requirements for new systems and architects and software engineers designing and building those systems understand the issues.

What makes a system “real time”?

First, we need to define the words. Real time has at least two major definitions:

1) Engineering real time, also called machine real time or hard real time, means guaranteed latency. An appropriate response is always executed within a fixed period of time following the arrival of new relevant information. In other words, the system always acts within a service level target, which is typically measured in microseconds, milliseconds, or at least subseconds (although it can be faster or much slower depending on the circumstances). Real-time systems avoid unpredictable delays, such as Java garbage collection and operating system functions that occur at random times in non-real-time operating systems. People are never included within hard real-time systems because people are too slow and erratic.

2) Business real time, also called near real time or soft real time, simply means fast.It means that action is taken in response to current data that arrived in the past few seconds or minutes. A different response might well be generated if things change a few seconds or minutes later. Business-real-time decisions almost always also use older information alongside the new (real-time) information. We sometimes say that anything that is based entirely on data that is more than 15 minutes old is no longer real time because it doesn’t reflect current conditions, but this is an arbitrary limit. A person is often in the loop, either making the decision or carrying out the decision, in a business-real-time process.

See also: Introducing the Data-in-Motion Ecosystem Map

So, there is no universal timetable for real time. Real time is whatever you need it to be, but with an eye toward speed and using some reasonable constraints. In both engineering and business, real time is all about acting in the “right time” for that particular process. You need to identify the point of diminishing returns where going faster doesn’t improve the results or where the costs of going faster outweigh the benefits of going faster.

Real time refers to the duration of the end-to-end process from the observation of new information to the execution of the response. It’s often useful to analyze a process using the observe-orient-decide-act (OODA) loop.

  • Current events from the environment are ingested (observe)
  • The new data is put into context by correlating it with other current data and historical knowledge, including historical data and known patterns (orient);
  • An appropriate response action is calculated (decide);
  • The relevant follow-up action is executed (act).

Real-time applies almost exclusively to operational processes, as opposed to tactical or strategic processes. However, it encompasses a wide range of operational processes with varying latencies:

  • Ultra-low latency processes, such as online ad-tech auctions and algorithmic trading for financial exchanges, collect data continuously and may calculate a new decision and trigger the response in less than a millisecond.
  • Routine online interactions and business transactions, such as presenting “next best content” web pages or “next best offers” to potential customers or checking for fraud in online financial transactions, typically execute in a few hundred milliseconds or a second or two (although the orient and decide parts of the process only take a few tens of milliseconds).
  • Real-time management control decisions include redirecting incoming customer phone calls to a backup call center if the local call center is overloaded or reassigning deliveries to an alternative truck and driver if a truck breaks down or a traffic jam develops. The end-to-end latency of these processes is typically measured in multiple minutes.

Notice that all of these examples depend on the presence of fresh, “real-time” data that represents recent events or the current state of the world.

Disclaimer: People sometimes use the term real time to describe the fast analysis of historical data – for example, getting an answer in a few seconds to a new ad hoc query against a billion rows of data from last month. Fast analysis of old data can be impressive and useful for other kinds of decisions, but it is not “real time” for the purposes of this discussion on data in motion.

See also: Unified Real-Time Platforms

Do real-time systems need to process data in motion?

Again, we need to start by defining “data in motion” because this term is also overloaded. Data in motion has at least three definitions:

1) Network engineers and network security practitioners use the term data in motion to mean data that is literally in transit on the wire or temporarily held in a network component (e.g., in a router, switch, modem, network interface card) or on a server in the network stack below the application layer. All real-time data arrives “in motion” in this sense. Historical data that was generated long ago is also shipped around through networks, so it is also sometimes “in motion” in this sense, although it is not real time.

2) Event processing experts use the term “in motion” to mean streaming data that is incrementally processed in (near) real time as it arrives before it is stored in a file or database. In this case, the data is in the application layer, not in the network, but it has been cached in a memory buffer in an application, event stream processing (ESP) platform, or other tool. Each message can be individually processed (statelessly), or groups of messages can be processed together statefully, for example, in a time window.

3) Industry analysts and vendors sometimes apply the term “data in motion” even to data that is stored before it is processed as long as it is processed in (near) real time. This can be the same kind of streaming data as in the previous definition (i.e., the input can be a continuous sequence of records from a sensor or other device, business application, the web, or another source). After it is stored, the data is retrieved in order of arrival or by using a log offset number, index key, or some other attribute. If the data is stored in a Kafka topic or similar append-only, immutable message log and used in (near) real time, it is somewhat easy to consider it to be data in motion (even though it has been stored before it is processed, introducing a minor amount of latency). However, if the same data is stored in a file or database instead of in a message log before it is processed, it is slightly less clear if it should be called data in motion because it is organized and retrieved differently from a topic or other message log. Nevertheless, if the data is used in (near) real time it really doesn’t matter whether you call it “in motion” because it still may support very low-latency applications. Conversely, if a data stream is stored in a Kafka topic, another message log, a file, or a database and then used days, weeks, or months later for non-real-time applications, it is no longer “data in motion” even if it is a sequence of immutable data processed in order of long-ago arrival (call it a historical data stream or event log).

See also: Beyond Kafka: Capturing the Data-in-motion Industry Pulse

In summary, no, you don’t always need data in motion to operate in real time. However, some high volume/low latency real-time systems do use data in motion in the sense that data is processed before being stored in a file or database. Other scalable, low-latency real-time systems store the data before processing it. Architects and software engineers must consider the specific characteristics of the business problem to determine how to design their real-time application and what tools to use. In Part 2 (Four Kinds of Software to Process Streaming Data in Real Time), we’ll look at four categories of software tools and when to use each.

Roy Schulte

About Roy Schulte

Roy Schulte is a former Gartner Fellow and co-author of the book “Event Processing: Designing IT Systems for Agile Companies”. He holds a BS and MS from MIT, and his recent work focuses on stream processing, real-time analytics, and decision intelligence.

Leave a Reply

Your email address will not be published. Required fields are marked *