Streaming Analytics Basics: Kafka, Spark, and Cassandra

PinIt
Using data pipelines with Apache Kafka, Spark and Cassandra

The Kafka-Spark-Cassandra pipeline has proved popular because Kafka scales easily to a big firehose of incoming events, to the order of 100,000/second and more.

For many companies who have already invested heavily in analytics solutions, the next big step—and one that presents some truly unique opportunities—is streaming analytics. This type of analytics allows companies to ingest data and immediately gather insights from processing that data, which enables a different and more immediate kind of agility.

While many use the words “streaming” and “real-time” analytics rather interchangeably, it’s important to recognize the differences between the two. Streaming analytics refers to the processing actions taken on data, whereas real-time refers to the speed by which that processing happens. In streaming analytics, data is ingested and processed as soon as it becomes available and reaches the front of the processing queue, and is only held back by the sheer processing power of the technical backbone.

A streaming analytics solution can be real-time if it is constantly processing data, and does it quickly enough to render the results fast enough for the needs of the particular application. In a 2016 survey conducted by ESG Global, most companies responded that in order for a system to be “real-time” for them, it would need to update data within seconds (34 percent) or milliseconds (35 percent).

Nik Rouda, a senior analyst at ESG Global, said in a webinar about real-time and streaming analytics that more companies are starting to recognize the importance of processing data faster and ways that are vastly different than more traditional batch-type analytics. Their priorities are aligned precisely with the areas in which streaming analytics are most powerful—they want faster tactical responses to customers (54 percent), reduced risks (54 percent), stronger sales and marketing performance (50 percent), and operational efficiency (49 percent).

For example, let’s say a company is launching a new product and a new online ad campaign. With streaming analytics, they can simultaneously ingest campaign performance (engagement and clicks) alongside social media data with customer feedback, and immediately send it to the streaming analytics solution. Once the data is parsed, they can tweak the campaign and address potential concerns immediately, rather than waiting until it’s over.

Kafka, Spark and Cassandra: mapping out a ‘typical’ streaming model

Rouda and Nanda Vijaydev, the director of solutions at BlueData Software, both propose one streaming analytics solution, which begins with Kafka, which handles ingest and stream processing, Spark, which performs streaming analytics, and Cassandra for data storage. This is a popular and completely open source option, although both Rouda and Vijaydev are quick to reiterate that streaming analytics can be done via a large number of different configurations.

“Part of the thing that’s incumbent on you, as decision makers, is to figure out what are the best set of tools to deliver the right functionality you need,” Vijaydev says. “There is no vendor lock-in—these are decoupled, individual systems.”

To that end, Rouda emphasizes that those seeking streaming analytics must keep an important end goal in mind. He says, “A lot of vendors are happy to say, ‘Download, install it, spin it up, and you’ll be ready to go in 15 minutes,’ but the reality is for the business to actually change the way it works, for end users to make analytics a part of their daily activities. That should be your end point.”

Spark and Spark Streaming is the core of this particular streaming workflow. Spark Streaming is an extension of the Apache Spark API, and can be used to integrate data from different event streams (such as Kafka and Twitter) asynchronously. It’s made for working with streams of continuous data, and is praised for the ease of programming, the ability to combine it with many different data stores, and the flexibility to run it on-premises or in the cloud.

The Kafka-Spark-Cassandra pipeline has proved popular because Kafka scales easily to a big firehose of incoming events, to the order of 100,000/second and more, and offers easy connectors to popular streams of data, such as social media. Spark can be configured in a multitude of ways, such as running SQL queries or machine learning on the same data stream, plus an incredibly vigorous developer community. Cassandra, as the final piece, enables further analytics via low latency and high throughput—perfect for dashboards that help reveal new insights after the fact.

Experimenting with streaming analytics

Rouda and Vijaydev both agree that for the most part, particularly in the early stages of implementing streaming analytics, one can’t go wrong with the Kafka-Spark-Cassandra pipeline, but they also emphasize the importance of experimentation and iteration.

Because most of the solutions are decoupled, it’s easy to swap them and reconfigure them as necessary, and even run streaming and more traditional query analytics from the same platform. A number of providers, such as BlueData, offer software that helps deploy pre-configured Docker containers running every aspect of this pipeline to even further simplify the rollout. And once the infrastructure is there, everyone can start playing around with the data, regardless of their skill level.

From there, it’s all about figuring out what doesn’t work, and fixing it. Vijaydev says, “Once you have the necessary infrastructure to work on these things, it is a matter of iteration. You start, you consume messages, it breaks, you fix it, you see how it works, and you do it over.”

Related:

Streaming analytics and event processing

Apache Spark, Hadoop and Kafka

Joel Hans

About Joel Hans

Joel Hans is the former managing editor of Manufacturing.net. He earned his master's degree from the University of Arizona, and currently lives and writes in Tucson.

Leave a Reply