Streaming Analytics: To Stream or Not to Stream?

steaming data

Streaming analytics products should be able to handle huge amounts of data in motion.

According to last summer’s Bloor Market Report on Streaming Analytics, it’s “no longer a question. Streaming analytics is here to stay.”

Everyone, of course,  is now claiming they do streaming analytics. There are a dozen or more open-source streaming offerings, and the entire Complex Event Processing market now falls under the label of streaming analytics (See “Forrester Wave 1Q16 on Streaming Analytics.”)

Streaming analytics vs. messaging

The market calls quite a few products “streaming analytics,” but many offerings that aren’t really streaming are called streaming.  AWS Kinesis, for example, is really just Apache Kafka, which ‘streams’ data into a data store for 24 hours, allowing you to read it out and analyze it on some other server.  But Kinesis still claims to be a streaming analytic offering.

What about Spark Streaming?  Is that streaming?

Again, it all depends on your perspective. Certainly, the folks who created Spark Streaming at UC Berkeley’s AMPLab thought it was streaming, hence the name. But like Kinesis, it ‘streams’ data into an in-memory data store (a resilient distributed dataset, or RDD) and then launches a batch application onto Spark to process that micro-batch of data.

Is Kafka a streaming offering?

It depends on your perspective. It was originally a messaging product, like IBM MQ or IBM Integration Bus or Apache NiFi. But while you’re moving the message from one place to another, you can introduce some extra Java code to look at the data or transform the data. So you’re doing more than just delivering the message; only the market will decide if that’s really streaming, or just enhanced message delivery.

What is messaging?

A simple answer is that messaging products are designed to deliver electronic messages to one or more destinations.

So messaging is like email or SMS texts?

Well, no … email is about text that you send (with possible attachments) to one or more people and is the electronic version of postal services. SMS is text between phones.  Messaging is about delivering messages to computers, which is a broad class of systems that may include smartphones.

Can a messaging system like Kafka do streaming analytics?

Now we’re getting to the heart of things.  Streaming is all about analytics – getting in data (like a message from Kafka, or an event from a sensor or a record from a data base) and then analyzing it.  A messaging system can take a message and analyze the message, but it’s really not designed to do that.  Think of a postal service.   It’s certainly possible for a postal service to open a letter, read the contents, create a summary, and then put the summary back in the envelope and send it to the recipient, but that’s not what it’s designed to do. Kafka and other messaging services are designed to deliver the message, not analyze it.  But of course, they are computer systems – they can do all sorts of  things they weren’t designed for!

Streaming analytics defined

Since 2009, IBM has defined “streaming” as a system with the following characteristics:

  • It can handle a huge volume of data – more than traditional systems.  It can run on a cluster of computers, and could scale horizontally and vertically to handle millions of messages (sometimes called events or records) every second.
  • It can handle a huge variety of data and analytics – unstructured data like video or image or text or structured data like IoT temperature sensors.  And a huge variety of analytics to see patterns in images, recognize cod spawning as different than haddock spawning, detect fraudulent transactions and more.
  • It could perform across a cluster of computers with very low latency to handle data velocity — with latencies down to microseconds, as required for algorithmic trading or web advertising.  This low latency requires continuous processing and the ability to handle each event as it arrives, not in a batch.

We know that everyone has adopted volume, variety and velocity as the very epitome of big data, but I think we’d all agree that Hadoop clusters, and even Spark can’t truly deliver on velocity – responses in milliseconds or less.  Nor can Kafka or any messaging systems since to guarantee message delivery, they write to persistent storage which takes extra time.

Must I have all three requirements to need streaming?

No. Any need for volume, variety or velocity may lead you to a streaming analytics. For example, a streaming product could have customers running batch applications (such as Telco CDR Mediation) or near real time (such as ICU customers) due to volume or variety requirements.  Or algorithmic trading customers, for instance, may require velocity, but their volume fits on a single server and doesn’t involve a variety of data types.

OK, but Kafka streaming claims to have some analytics, doesn’t it?

Kafka can perform simple aggregations, joining streams, and can even deliver messages in milliseconds.  But it doesn’t have the variety of analytics like speech to text, facial recognition, machine learning, or scoring of models.  And the question is, when you take something and start doing things it’s not designed to do, can it still handle the original work?  Will it meet availability and reliability requirements for a streaming solution?

So, back to the questions: to stream or not to stream?

Everyone needs streaming. All data is created in real time, so why not process it right away in real time?  It’s more efficient than first landing the data, even in an in-memory data store. Streaming is analysis on the data in motion, before you persist, with ability to handle huge amounts of data, structured and unstructured data, and can be done with very low latencies.


How to apply machine learning to event processing

Finding the right recipe for complex event recognition

Roger Rea

About Roger Rea

Roger Rea is the InfoSphere Streams Product Manager at IBM.

Leave a Reply