Apache Streaming: Why Enterprises Struggle With Storm, Spark

Apache streaming

Enterprises looking to support streaming analytics often turn to Apache Storm and Apache Spark Streaming, two popular open-source projects. Here, RTInsights contributor Phu Hoang discusses the benefits and challenges enterprises discover when using Apache Storm and Apache Spark Streaming.

With so much hype around Apache Storm and Apache Spark Streaming, many assume that open-source projects are the obvious choices to support streaming analytics. After all, there are three benefits that enterprises can reap by choosing Apache streaming technologies.

The first benefit that comes to mind is that Apache Hadoop and YARN (Yet Another Resource Negotiator) have enabled large-scale and extremely fast in-memory streaming and batch processing. The second benefit is that Apache Storm and Apache Spark Streaming’s open-source ecosystem are playing a big role in helping to define and expand the broader market around streaming. The third benefit is, given Apache Storm and Apache Spark Streaming’s popularity, many companies are rethinking their data processing use cases and using streaming to reduce time to insight and action.

However, contrary to popular belief, enterprises have often struggled with Apache streaming’s narrow focus on enterprise business objectives, rarely getting beyond the proof-of-concept (POC) stage.

To be truly successful, a solution needs to meet the needs of data scientists, developers and data center operations teams–without needing extensive and expensive patchworks of custom code that are prone to failure. While there have been some successful use cases with Apache Storm and Apache Spark Streaming, these are typically for simple streaming data transformations and cleansing before loading to the Hadoop distributed file system (HDFS).

The following are the top three reasons why enterprises deploying open-source solutions struggle with Apache Storm and Apache Spark Streaming:

 1: Enterprises require an enterprise-grade streaming analytics platform

Enterprises using Apache streaming technologies are quickly finding out that it is not sufficient to take open-source code (that might work for some large web-scale organizations with scores of platform-level developers) and try to deploy it in an enterprise data center. Apache Storm and Apache Spark Streamings applicability is limited by their core architecture.

When evaluating options, enterprises should be asking the following three questions:

1. Is the streaming analytic solution’s fault tolerance complete (i.e., raw events, application state and engine state), abstracted from the developer and done natively in Apache Hadoop using HDFS?

2. Does it run natively on YARN for scheduling?

3. Is it automatically scalable and able to process increased data loads without reprogramming and redeployment?

If the answer to any of these is “no,” then the odds are that your organization’s developers will spend unnecessary hours coding at the platform level to make the solution work-hours they could be spending on your product.

2: Apache streaming is not friendly to data scientists and application developers

The path to production-ready streaming analytics entails a lot of upfront experimentation. Data scientists and application developers should be able to use intuitive visual tools to quickly create streaming applications and iterate over their hypothesis. These iterations should not involve cumbersome coding by developers.

The Java API in Apache Storm and Apache Spark Streaming requires a lot of hand coding as there is no library of prebuilt code, and there are few data input and output connectors. Programming for Apache Storm and Apache Spark Streaming is also tedious as the developer must manually account for scalability, handle input data skews, hand-code fault tolerance for the application data and attempt to force event ordering/re-ordering. Both options lack visual development tools and require coding by developers. Additionally, neither platforms enable non-data scientists to create simple applications to quickly iterate over their analysis.

To maximize developer productivity, look for a solution that:

1. Includes connectors to support fault-tolerant, auto-scaling data ingestion and distribution for all of your data sources and analytics destinations.

2. Provides common data analytics capabilities such as joins, aggregations and statistical analysis, available immediately.

3. Enables complex capabilities such as dimensional cube creations and integration with machine learning tools.

Apache streaming architecture
Spark streaming architecture. Spark’s core API rests on R, SQL, Python, Scala, and Java.

3: Rudimentary capabilities for applications

Fast Big Data doesn’t stop and neither can the insights and actions that your business takes. As a result, streaming analytics applications are designed to run 24x7x365 with no downtime. Data center operations teams must ensure that the full life cycle of application deployment–including, monitoring, updating and problem resolving–meets the organization’s business commitments. Management requirements extend not only to on-premises deployments but also to cloud and hybrid cloud/data center deployments.

Apache Storm and Apache Spark Streaming provide rudimentary capabilities across the application life cycle. The management and monitoring platform does not provide full visibility into all of the metrics of the streaming application and infrastructure. There are no considerations in Apache Storm and Apache Spark Streaming architecture for dynamic application updates.

Enterprises need to have a streaming solution that meets the following criteria:

1. Includes easy-to-use tools for the full application deployment and management operations cycle.

2. Provides operational support via visual, automated alerting and command line for your data center operations team.

While Apache Storm and Apache Spark Streaming are satisfactory for sandbox and POC purposes, they often fall short of production deployments in the enterprise. Both are limited by their core architectures, both are not friendly to data scientists and application developers, and both are lacking in capabilities across the application life cycle. The aforementioned questions and criteria will help guide the enterprise user to find a real-time streaming analytics platform that meets the grade.

Want more? Check out our most-read content:

Frontiers in Artificial Intelligence for the IoT: White Paper
IoT’s Value? It’s Mostly in Traditional Business
Research from Gartner: Real-Time Analytics with the Internet of Things
Report: Text Analytics Must Move Beyond Social

Liked this article? Share it with your colleagues using the links below!

Phu Hoang

About Phu Hoang

Phu Hoang is the cofounder and CEO of DataTorrent. Prior to DataTorrent, Phu was a founding member of the engineering team at Yahoo. Throughout his 12 years there, he led many of the engineering teams that built critical parts of the Yahoo network and portfolio including Search, Advertising, Commerce, News, and Finance and Sports. He served as their Executive Vice President of Engineering. He also served as Executive Chairman of Buysight, an ad exchange technology startup which was acquired by AOL in 2012. Prior to that, Phu was Vice President of Engineering & Product at Kiva, a nonprofit, micro-lending exchange. Follow him on Twitter @phuhoang.

Leave a Reply

Your email address will not be published. Required fields are marked *