StreamSets Centralizes Management of Data Pipelines

PinIt
gas pipe line

The firm’s goal is to allow application development by providing a cloud service through which data pipelines can be managed.

StreamSets today moved to bridge the divide between DevOps and DataOps by making it simpler to manage data flows across multiple application pipelines.

The goal is to accelerate development of modern applications by providing a cloud service through which a set of graphical data flows and pipelines can be managed, says Clarke Patterson, head of product marketing for StreamSets. “It provides a cloud-based approach to creating and orchestrating pipelines,” he adds.

See also: Maximizing containers with freedom of choice

Via this update to the company’s namesake platform there is now a continuous integration/continuous deployment (CI/CD) framework for automating frequent changes to data flows, says Patterson.

Other new capabilities include support for Kubernetes clusters and a data flow designer that comes with pre-configured connectors for data sources such as Amazon S3, Elastic MapReduce (EMR) and RedShift; Azure Data Lake Storage, HDInsight and Azure Databricks; Google DataProc and Snowflake.

In addition, StreamSets now also provide support for a data drift handling capability, which automatically reflects updates to source schema in Amazon Athena, Azure SQL, and Google BigQuery cloud data services. Finally, a StreamSets Data Protector tool allows policies attached to sensitive data to be detected and enforced.

In general, many of the DevOps concepts and processes originally pioneered to accelerate application development are now being applied to how data gets managed, also known as DataOps. Rather than waiting weeks for a database administrator to construct a schema to expose a set of data pipelines, DataOps enables data pipelines to be exposed in a much faster, agile manner.

Fresh off raising an additional $35 million in funding, StreamSets is applying these concepts at a time when the way applications are being developed and deployed is fundamentally changing thanks to the rise of microservices-based architectures. Rather than having to update an entire monolithic application, new functionality can be added to an application more easily by updating only a limited number of microservices. That microservices approach, in theory, enables IT organizations to build and maintain a much larger portfolio of applications.

But each of those microservices is tapping into a pipeline to process data. Those pipelines are increasingly being connected to platforms that make that data available in real-time. Most microservices are being built using containers running mainly on Kubernetes container orchestration engines that are quickly emerging as a de facto standard.

All those container-based microservices are all trying to access data at a level of concurrent scale that is unprecedented. Given the massive volume of data involved, a more consistent approach to managing that data in the form of DataOps is now required. Organizations will clearly need to meld their DevOps and DataOps initiatives to construct applications capable of analyzing data in real-time.

StreamSets claims that in the previous four fiscal quarters it has doubled its commercial customer count and tripled its revenues and the open source StreamSets Data Collector that is at the core of the platform has been downloaded well over two million times by thousands of companies. Commercial customers include Commercial customers include GSK, Chesapeake Energy and Solera Holdings, and the company notes over two-thirds of StreamSets commercial customers subscribe to one or more of its proprietary software offerings.

The challenge now is getting everyone within IT on the same data pipeline page at a time when microservices-based applications will have more dependencies than ever.

Leave a Reply