Digital Integration Hub architectures let businesses query across real-time and historical data to improve operations and drive better business decision making.
The ever-faster pace at which companies compete has them focused on “data-driven” digital transformations in which their real-time business processes can be driven based on analysis of the entirety of their operational and historical data. However, most organizations still struggle to achieve this. A recent HBR survey found that 69 percent of companies had yet to create a data-driven organization. The challenge can be great for companies developing applications for use cases such as predictive maintenance that require real-time analysis across their subset of their data lake and operational data.
Fortunately, a Digital Integration Hub architecture built using a suite of open source solutions powered by an in-memory computing platform now makes it possible to bridge operational and historical data, enabling real-time analytics and machine learning across the combined dataset. These processes can, in turn, drive real-time business processes, which can result in better business performance and happier customers.
Worlds Apart: Operational vs. Analytical Databases
The split between operational and analytical datastores was created decades ago to ensure the performance and reliability of operational systems. Attempting to analyze data in the same database used for transactions has the potential for serious application performance degradation, so separate data warehouses evolved to support fast analytics without impacting operational performance. However, data warehouses require clean, structured data, and a time-consuming extract, transform, and load (ETL) process, so the data is old as soon as it arrives at the data warehouse. The structured nature of the data warehouse also limits what data can be stored and what insights can be derived. These limitations led to the development of data lakes.
Today, companies typically push all of their operational data into a data lake, often built on Apache Hadoop, which can store unstructured and non-normalized data. However, data lakes were optimized for data exploration and not to drive real-time business processes. Their purpose is primarily as a data repository on which data scientists can run ad hoc analytics or long-running analytics jobs. The data in a data lake may span years, and analytics queries may take hours or days to complete.
The Data Lake Digital Transformation Challenge
A common requirement of digital transformation use cases is the need to drive real-time business processes based on a holistic view across a company’s data lake and operational data sets.
Consider the following Internet of Things (IoT) use case. A trucking company has thousands of vehicles constantly on the move. To improve asset utilization, each truck is equipped with a myriad of sensors that detect location, altitude, engine temperature, fuel consumption, internal trailer temperature, humidity, tire pressure, etc. These sensors send massive amounts of data in real-time to a centralized IoT data store. The IoT platform ingests, processes, and analyzes the streaming device data, providing the company with insights into the current state of each vehicle, as well as asset utilization across the fleet.
Suppose the company wants to implement predictive maintenance to be able to service trucks before they fail and thereby reduce costs associated with out-of-service vehicles. To accomplish this, the company needs to create a comprehensive view of the state of the vehicle by identifying anomalies in the streaming data and combining this information with the broader view provided by the historical data stored in the data lake for the specific vehicle in order to identify when it is time to provide service to that specific vehicle. While a seemingly anomalous reading may be out of the norm across the entire fleet of vehicles, it is only possible to understand whether the reading indicates a problem which needs attention by simultaneously comparing it with the current state of the entire fleet as well as to a deep dive of the historical data for that specific vehicle and for that specific reading. What is normal for one vehicle may not be normal for all vehicles.
This comparison requires pulling relevant data from the data lake and delivering it to a platform where real-time analytics can be run across the combined historical and operational data. An increasingly popular strategy for doing this is to use an in-memory computing platform. The platform can power a hybrid transactional/analytical processing (HTAP) platform with the built-in capability to run the federated queries. In addition, machine learning model training can be run across this combined operational and historical data set in order to improve and update the model, which is used to identify readings that indicate a need for maintenance. To understand how this works, it’s important to review the components of an in-memory computing platform.
The Basics of In-Memory Computing
In-memory computing platforms are now commonly deployed in enterprises. Gartner has predicted in-memory computing will be incorporated into most mainstream products by 2020. A basic in-memory computing platform enables ingestion through a flexible set of APIs of data from multiple datastores simultaneously, real-time processing of data within the data compute grid, the flexibility to be scaled out or up to support petabytes of in-memory data, and a flexible set of data access APIs to provide easy integration to one or multiple business applications. An in-memory computing platform deployed for an in-memory data grid (IMDG) or in-memory database (IMDB) use case is deployed on a cluster of servers on-premises, in a public or private cloud, or in a hybrid environment.
An IMDG can be used with existing database or databases—with no need to rip and replace the existing databases —and keeps the underlying databases in sync. An IMDB holds data in-memory for processing, and to disk for backup and recovery. The IMDB should also be able to process against the disk-based dataset, enabling fast restarts as well as a tradeoff between application speed and infrastructure costs by maintaining only some of the data in memory.
The IMDG or IMDB pools the available RAM and CPUs of the servers in the cluster, allowing all data to be maintained and processed in real-time in memory. This eliminates the constant disk reads and writes typical with applications built on disk-based databases. In addition, the compute cluster performs massively parallel processing (MPP), leveraging the computing power of each server in the cluster to improve processing speed even further versus databases that run on a single server. Eliminating the disk read/write bottleneck and implementing MPP can improve application performance by up to 1,000X.
The performance and scale achieved by in-memory computing platforms can enable HTAP, which is also known as hybrid operational/analytical processing (HOAP), or translytical processing. HTAP provides the ability to maintain a single dataset on which simultaneous transactional and analytical processing is possible. With HTAP, standard, tested analytics queries can be processed on the latest operational data without the need for an ETL process and without degrading the performance of the business applications using the data in the HTAP platform. In addition to enabling real-time analytics on operational data, HTAP eliminates the cost and complexity of maintaining separate infrastructures for operations and analytics.
An in-memory computing platform may also include:
Streaming data processing capabilities – Streaming data processing is the ability to perform continuous ingestion and queries on real-time data that is flowing into the platform from external sources, such as IoT devices or third-party data providers. Many in-memory computing platforms integrate with a streaming data platform solution that manages the complexity of routing data streams (for example, Apache Kafka®), enabling the in-memory computing platform to rapidly ingest and analyze high-volume data streams with real-time performance.
Machine learning and deep learning capabilities – In-memory computing platforms that incorporate machine learning (ML) libraries enable a continuous learning environment, or what Gartner refers to as “in-process HTAP.” A continuous learning environment allows real-time training of machine learning models based on analyzing the operational dataset in real-time and subsequent updates to the machine learning model used in the associated business application. For example, to minimize the spread of new loan scams, a bank must continuously update its machine learning model of what indicates a loan fraud attempt based on the real-time data of new loan applications and push the resulting updated model into its production systems.
Some in-memory computing platforms also incorporate native integrations with deep learning (DL) platforms, such as TensorFlow. This integration can dramatically decrease the cost and complexity of transferring data to DL training platforms and updating DL models following training.
Querying Across Data Lake and Operational Data
With an in-memory computing platform, it is possible to add the ability to run federated real-time queries across combined data lake and operational data. For this article, we will look at three open source solutions that can play important roles in enabling this capability:
- Apache Hadoop: Hadoop is commonly used as the data store for data lakes. Hadoop has a distributed file system that provides easy-to-deploy, inexpensive data storage for large, unstructured datasets. Hadoop can scale to thousands of cluster nodes, supporting the storage of massive, historical datasets. However, while Hadoop supports data storage, it does not include a computational engine, so development is necessary to deploy a system capable of analyzing the data stored in Hadoop.
- Apache Kafka: Kafka is used for building real-time streaming data pipelines that enable incoming data to flow to the operational database and thus to the in-memory computing platform for real-time processing. It can also be used to create a data pipeline to stream application data directly into the in-memory computing platform.
- Apache Spark: Spark is a unified analytics engine that can perform large-scale data processing on the data, such as transferring the data from a Hadoop data lake to the in-memory computing platform and performing federated queries across the data lake and operational datasets.
The combination of the in-memory computing platform and these open source solutions enables real-time analytics across the combined data lake and operational data. Returning to the example of the trucking company, we can see how these components work together:
- Apache Kafka feeds the live streaming data from the IoT sensors into the in-memory computing platform, the operational database, and the Hadoop data lake.
- When an anomalous reading in the live data is detected by the application, such as high engine temperature or fuel consumption, Spark retrieves the relevant historical data from the data lake and delivers it to the in-memory computing platform.
- The in-memory computing platform maintains the combined data lake and historical dataset in memory, so Spark can run queries across the combined dataset, resulting in deep and immediate insight into the causes of the anomalous reading.
The demand to run real-time analytics across data lake and operational data to drive real-time business processes cuts across industries. In addition to enabling predictive maintenance in an endless number of scenarios (washing machines, copy machines, power grids, computing infrastructure, etc.), it can help:
- Credit card companies reduce the risk associated with approving new payment charges during the checkout process.
- IT departments detect and understand security threats in real-time to prevent breaches.
- Retailers create 360-degree customer views to drive upsell and cross-sell opportunities.
In each case, creating Digital Integration Hub architectures to provide the ability to query across real-time and historical data can improve operations and drive better business decision making, improved products, enhanced customer experiences, and increased asset utilization and ROI. Powered by in-memory computing, real-time processing across live and historical data has become a standard enterprise capability for leading companies undergoing digital transformations.