Using Apache Pulsar for Data-Driven Applications

PinIt

A discussion of the benefits Apache Pulsar brings versus other streaming technologies, and how companies are using it.

At its core, Apache Pulsar uses a publish-and-subscribe technique for building events and streaming data applications. It offers a multi-tenant scalable messaging system that can serve as a unified platform for a wide variety of demanding use cases. As a result, it has a lot to offer when developing and running data-driven applications.

RTInsights recently sat down with Chris Latimer, Vice President of Product Management at DataStax. We discussed what drew DataStax to Pulsar, what benefits Pulsar brings versus other streaming technologies, and how companies use it. Here is a summary of our conversation.

RTInsights: Why did DataStax decide to invest in Pulsar?

Chris Latimer, VP of Product Management, DataStax

Latimer: When we started to think about getting into streaming, we talked with as many customers and potential customers as we could to understand what types of problems they were facing. We found that organizations were struggling in the space of data and motion, and they were struggling for a few reasons.

The first thing that we heard was that a lot of the data they wanted to capture was not readily available. So, they brought in things over the years, like various MQ-type platforms. Our talks with customers were almost like doing an archeological dig of the enterprise architecture space.

You’d see the early MQ series era and the JMS era. As folks went into the cloud, you would see cloud-native services like Amazon, SNS, SQS, and Google cloud pub/sub. And more recently, you’d see things happening in streaming. All of these things were disconnected.

The thing that we heard the most is that stitching these things together was a major challenge. Anytime you had a system that was on AWS, and it needed to exchange messages with something that was on-prem, you had to solve an integration problem. It was basically an impossible challenge for data scientists and folks in that type of role who were trying to get a unified view of all of the events and messages that were happening within the organization.

So, we started to look at technologies that would be able to address those issues. We looked at Kafka, and we found that a lot of the organizations already had Kafka, and it really was not solving the core problems that they had.

That’s when we started to look more closely at Apache Pulsar. There were a few things we really liked about it. The first is that from an architectural perspective, Pulsar is really the next-generation architecture for messaging and streaming platforms. It combines both messaging semantics, as well as streaming semantics, which then gives companies a way of unifying that messaging and streaming capability into a single platform.

DataStax is also very pro-open-source. We’re known as the company that’s focused on Cassandra, and obviously, we’ve done a lot in that space. We always look to strengthen the open-source technologies that we’re working with. As we looked at some of the other platforms that are out there, what we found is that there was a disconnect between the capabilities of open-source and the capabilities of vendor-specific products that were built on top of those open-source solutions.

The thing we really like about Pulsar is that it’s complete from an open-source perspective. You have things like geo-replication, multi-region support, multi-tenancy support, and schema registries. These capabilities are missing on other platforms, especially a platform like Kafka.

You’d have to go to a vendor to get those capabilities, or you would have to cobble together open-source libraries that other folks made. So, we really liked the fact that Pulsar had a complete architecture. And we also liked the fact that there is a comprehensive feature set in the core Apache Pulsar project.

We looked at these factors, and we matched them up with problems our customers were having with data motion and unifying that data in motion across both messaging and streaming. Then we looked at Pulsar from a technical fit. We found that from a technology perspective and the DataStax values and philosophy around the open-source perspective, Pulsar was a great fit.

RTInsights: What are the common things you see in companies adopting Pulsar?

Latimer: There are a few common threads. One is the breadth of messaging technologies that are in use and the challenges that are associated with managing all those different MQ, JMS, and Kafka platforms and wanting to consolidate them.

We see organizations that are getting more serious about event-driven architecture and trying to figure out how to unify across cloud and on-prem. They also want to unify across streaming and messaging.

We also see a shift in some companies that are thinking about messaging and event streaming less as an application integration problem and more as a data problem. They are thinking about the value of the data that is currently trapped and siloed inside legacy platforms.

If you talk with data scientists or even data architects, a lot of times, they are starting to say, “if this data were stored in a relational database, we would never ever accept the state of things as they currently stand. We would be looking to do things like master data management. We would be looking to do things like have a single source of truth for this data. And we have these same problems with our data in motion.”

It just so happens instead of residing in a database, the data is being published into the variety of messaging platforms that I mentioned. That is why companies that are starting to think about this as a data problem are really attracted to Pulsar because it is able to capture, across the entire organization, all of that message data.

Sometimes the data is not particularly useful. Sometimes it’s just an event. But if you start putting those events together into the time series, that becomes a really interesting data set. Data scientists can use that dataset for things like predictive analytics or anomaly detection. And if you start to look at it just through a data architecture lens, you have a lot of the same challenges that you want to solve around data consistency. If real-time data streaming data is being used for things like data pipelines, you want to make sure that the data being pushed into that operational data store is in a form and a structure that aligns to the organizational data standards that you’ve put in place.

These are the common threads for companies that are adopting Pulsar. They are looking to unify messaging and streaming. Oftentimes, they are driven to Pulsar because they’re advancing their event-driven architecture. And they are making an investment in data scientists and data science, treating this data as data and not just as simply an artifact of integration.

RTInsights: How is Pulsar better/different than Kafka?

Latimer: At an architectural level, I think about Kafka as a pioneer in this space. And to a certain extent, I feel like the industry owes Kafka and the folks at LinkedIn a debt of gratitude for coming up with this early first-generation solution. It opened up a lot of use cases that made people aware of the value of event streaming. But like any pioneer, there are things that you’re going to get wrong, and that if you could go back in time and make a different decision, you probably would.

With Kafka, there were some fundamental architectural challenges or decisions that were made that turned out, at least in my view, to be the incorrect decision. For example, Kafka decided to go with a monolithic architecture. And what I mean by that is they decided to combine the storage capabilities with the message serving capabilities. That creates some serious bottlenecks when it comes to scaling out Kafka. Certainly, Kafka can scale up. But the process of scaling Kafka is not easy, and the process of scaling down is basically impossible.

Kafka has an architecture that tightly couples the topic storage with serving messages on that topic. That means that if you need to add additional brokers because that’s basically the monolithic architecture of the broker, you have to rebalance that topic across the new brokers that you’ve added. The nut of that is that when you need to scale, at the worst possible time, Kafka decides I’m going to take some capacity from these already overextended brokers, and I’m going to dedicate it to rebalancing these topics across all the brokers. That’s just an unfortunate choice.

It has other side effects as well. If you are an organization that wants to store your event and stream data indefinitely with Kafka, that’s a really, really expensive proposition. You must maintain expensive, usually high-throughput, high-storage performance virtual machines or physical machines. It becomes a problem from a TCO perspective.

Whereas out-of-the-box, Pulsar supports the notion of tiered storage. So, you can offload that historical event stream data onto a really cheap solution like HDFS, S3, or a data lake and still maintain the ability to use all of the interfaces that you have within Pulsar to access the data and process messages. From a data science perspective, you have that historical record convenient because you’re not suffering the trade-off of where you either need to build some out-of-band process or accept the high cost of data storage. With Pulsar, you get the best of both worlds.

RTInsights: What are some of the use cases you are seeing with your customers?

Latimer: Many of them come down to getting better visual experiences and just understanding their customer better. This is a place where that disjointed collection of messaging technologies can be a problem.

Imagine that you’re trying to create sort of what we would call a digital twin of a customer, where I want to really have a complete digital representation of what that customer has been doing and how they’ve been interacting with our business. You start by looking at that through the lens of how would I capture all of the places that customers are interacting with us? They may be interacting with certain systems that are in the cloud. They may be interacting with legacy systems. You end up in that same situation where all of the customer events that are happening are scattered throughout the organization. So, if you want to get a digital twin of that customer, you have a really big and complex integration challenge.

One of the benefits that organizations get by looking at their event-driven architectures, trying to modernize them, and trying to consolidate those systems, is that something like creating a digital twin of a customer becomes pretty easy. You have all the messages and all the events for that customer in a consolidated place. For many Pulsar customers, It boils down to kind of a customer 360 type of use case.

IoT is another big use case. Sometimes we tend to think IoT is really just a manufacturing-specific problem, but we’re seeing this in many different industries. We’ll see it with banking with things like ATMs.

We’ll see it with retailers’ point-of-sale systems or with in-store experiences. All those things capture important event data that can potentially be streamed. And sometimes, those are things that are tied in with the supply chain and real-time supply chain. Sometimes, they are just trying to drive organizational efficiency. And sometimes, there are things being used to create better-personalized experiences for customers.

Most of the use cases are being driven by forward-looking enterprise architects who are saying, “Hey, our business, the industry that we’re in, is increasingly going to be under competitive pressures because our competition is becoming more real time.” So, they need to get technology out ahead of that to do a better job of capturing events and acting on those events in real time.

Those are some of the main drivers that we’re seeing as we talk with customers who are using Pulsar.

Another class of applications has more of a data engineering perspective, in that companies want to create a real-time data pipeline. Use cases like Change Data Capture can be really convenient if you’re trying to modernize your event-driven architecture and you want to figure out how do I start capturing all of the events that are going on in the organization?

A lot of those events can be captured just by listening to your database because oftentimes, they’re being captured in the form of inserts or updates that are happening. So, Change Data Capture and data engineering is another big use case that we see. It can be things like moving from a batch ETL into real time, but it can also use those change events to propagate events downstream to other systems that want to be notified and take action.

Salvatore Salamone

About Salvatore Salamone

Salvatore Salamone is a physicist by training who has been writing about science and information technology for more than 30 years. During that time, he has been a senior or executive editor at many industry-leading publications including High Technology, Network World, Byte Magazine, Data Communications, LAN Times, InternetWeek, Bio-IT World, and Lightwave, The Journal of Fiber Optics. He also is the author of three business technology books.

Leave a Reply

Your email address will not be published. Required fields are marked *