The ever-faster pace at which companies compete has them focused
on “data-driven” digital transformations in which their real-time business
processes can be driven based on analysis of the entirety of their operational
and historical data. However, most organizations still struggle to achieve
this. A recent HBR
survey found that 69 percent of companies had yet to create a data-driven
organization. The challenge can be great for companies developing applications
for use cases such as predictive maintenance that require real-time analysis across
their subset of their data lake and operational data.
Fortunately, a Digital
Integration Hub architecture built using a suite of open source solutions powered
by an in-memory computing platform now makes it possible to bridge operational
and historical data, enabling real-time analytics and machine learning across
the combined dataset. These processes can, in turn, drive real-time business
processes, which can result in better business performance and happier
customers.
See also: Continuous Intelligence Data Considerations
Worlds Apart: Operational vs. Analytical Databases
The split between operational and analytical datastores was created
decades ago to ensure the performance and reliability of operational systems. Attempting
to analyze data in the same database used for transactions has the potential
for serious application performance degradation, so separate data warehouses evolved
to support fast analytics without impacting operational performance. However, data
warehouses require clean, structured data, and a time-consuming extract, transform,
and load (ETL) process, so the data is old as soon as it arrives at the data
warehouse. The structured nature of the data warehouse also limits what data
can be stored and what insights can be derived. These limitations led to the
development of data lakes.
Today, companies typically push all of their operational
data into a data lake, often built on Apache Hadoop, which can store unstructured
and non-normalized data. However, data lakes were optimized for data
exploration and not to drive real-time business processes. Their purpose is
primarily as a data repository on which data scientists can run ad hoc
analytics or long-running analytics jobs. The data in a data lake may span
years, and analytics queries may take hours or days to complete.
A common requirement of digital transformation use cases is
the need to drive real-time business processes based on a holistic view across
a company’s data lake and operational data sets.
Consider the following Internet of Things (IoT) use case. A trucking
company has thousands of vehicles constantly on the move. To improve asset
utilization, each truck is equipped with a myriad of sensors that detect location,
altitude, engine temperature, fuel consumption, internal trailer temperature,
humidity, tire pressure, etc. These sensors send massive amounts of data in
real-time to a centralized IoT data store. The IoT platform ingests, processes,
and analyzes the streaming device data, providing the company with insights
into the current state of each vehicle, as well as asset utilization across the
fleet.
Suppose the company wants to implement predictive
maintenance to be able to service trucks before they fail and thereby reduce
costs associated with out-of-service vehicles. To accomplish this, the company needs
to create a comprehensive view of the state of the vehicle by identifying anomalies
in the streaming data and combining this information with the broader view
provided by the historical data stored in the data lake for the specific
vehicle in order to identify when it is time to provide service to that
specific vehicle. While a seemingly anomalous reading may be out of the norm
across the entire fleet of vehicles, it is only possible to understand whether
the reading indicates a problem which needs attention by simultaneously
comparing it with the current state of the entire fleet as well as to a deep
dive of the historical data for that specific vehicle and for that specific
reading. What is normal for one vehicle may not be normal for all vehicles.
This comparison requires pulling relevant data from the data
lake and delivering it to a platform where real-time analytics can be run across
the combined historical and operational data. An increasingly popular strategy
for doing this is to use an in-memory computing platform. The platform can power
a hybrid transactional/analytical processing (HTAP) platform with the built-in
capability to run the federated queries. In addition, machine learning model
training can be run across this combined operational and historical data set in
order to improve and update the model, which is used to identify readings that
indicate a need for maintenance. To understand how this works, it’s important
to review the components of an in-memory computing platform.
The Basics of In-Memory Computing
In-memory computing platforms are now commonly deployed in
enterprises. Gartner has predicted in-memory
computing will be incorporated into most mainstream products by 2020. A
basic in-memory computing platform enables ingestion through a flexible set of
APIs of data from multiple datastores simultaneously, real-time processing of
data within the data compute grid, the flexibility to be scaled out or up to
support petabytes of in-memory data, and a flexible set of data access APIs to
provide easy integration to one or multiple business applications. An in-memory
computing platform deployed for an in-memory data grid (IMDG) or in-memory
database (IMDB) use case is deployed on a cluster of servers on-premises, in a
public or private cloud, or in a hybrid environment.
An IMDG can be used with existing database or databases—with
no need to rip and replace the existing databases —and keeps the underlying
databases in sync. An IMDB holds data in-memory for processing, and to disk for
backup and recovery. The IMDB should also be able to process against the
disk-based dataset, enabling fast restarts as well as a tradeoff between
application speed and infrastructure costs by maintaining only some of the data
in memory.
The IMDG or IMDB pools the available RAM and CPUs of the
servers in the cluster, allowing all data to be maintained and processed in
real-time in memory. This eliminates the constant disk reads and writes typical
with applications built on disk-based databases. In addition, the compute cluster
performs massively parallel processing (MPP), leveraging the computing power of
each server in the cluster to improve processing speed even further versus
databases that run on a single server. Eliminating the disk read/write
bottleneck and implementing MPP can improve application performance by up to
1,000X.
The performance and scale achieved by in-memory computing
platforms can enable HTAP, which is also known as hybrid operational/analytical
processing (HOAP), or translytical processing. HTAP provides the ability to
maintain a single dataset on which simultaneous transactional and analytical
processing is possible. With HTAP, standard, tested analytics queries can be processed
on the latest operational data without the need for an ETL process and without
degrading the performance of the business applications using the data in the
HTAP platform. In addition to enabling real-time analytics on operational data,
HTAP eliminates the cost and complexity of maintaining separate infrastructures
for operations and analytics.
An in-memory computing platform may also include:
Streaming data processing capabilities – Streaming data processing is the ability to perform continuous ingestion and queries on real-time data that is flowing into the platform from external sources, such as IoT devices or third-party data providers. Many in-memory computing platforms integrate with a streaming data platform solution that manages the complexity of routing data streams (for example, Apache Kafka®), enabling the in-memory computing platform to rapidly ingest and analyze high-volume data streams with real-time performance.
Machine learning and deep learning capabilities – In-memory computing platforms that incorporate machine learning (ML) libraries enable a continuous learning environment, or what Gartner refers to as “in-process HTAP.” A continuous learning environment allows real-time training of machine learning models based on analyzing the operational dataset in real-time and subsequent updates to the machine learning model used in the associated business application. For example, to minimize the spread of new loan scams, a bank must continuously update its machine learning model of what indicates a loan fraud attempt based on the real-time data of new loan applications and push the resulting updated model into its production systems.
Some in-memory computing platforms also incorporate native integrations with deep learning (DL) platforms, such as TensorFlow. This integration can dramatically decrease the cost and complexity of transferring data to DL training platforms and updating DL models following training.
Querying Across Data Lake and Operational Data
With an in-memory computing platform, it is possible to add
the ability to run federated real-time queries across combined data lake and operational
data. For this article, we will look at three open source solutions that can play
important roles in enabling this capability:
- Apache Hadoop: Hadoop is commonly used as
the data store for data lakes. Hadoop has a distributed file system that
provides easy-to-deploy, inexpensive data storage for large, unstructured datasets.
Hadoop can scale to thousands of cluster nodes, supporting the storage of
massive, historical datasets. However, while Hadoop supports data storage, it does
not include a computational engine, so development is necessary to deploy a
system capable of analyzing the data stored in Hadoop.
- Apache Kafka: Kafka is used for building
real-time streaming data pipelines that enable incoming data to flow to the
operational database and thus to the in-memory computing platform for real-time
processing. It can also be used to create a data pipeline to stream application
data directly into the in-memory computing platform.
- Apache Spark: Spark is a unified analytics
engine that can perform large-scale data processing on the data, such as transferring
the data from a Hadoop data lake to the in-memory computing platform and
performing federated queries across the data lake and operational datasets.
The combination of the in-memory computing platform and
these open source solutions enables real-time analytics across the combined data
lake and operational data. Returning to the example of the trucking company, we
can see how these components work together:
- Apache Kafka feeds the live streaming data from
the IoT sensors into the in-memory computing platform, the operational
database, and the Hadoop data lake.
- When an anomalous reading in the live data is
detected by the application, such as high engine temperature or fuel
consumption, Spark retrieves the relevant historical data from the data lake
and delivers it to the in-memory computing platform.
- The in-memory computing platform maintains the
combined data lake and historical dataset in memory, so Spark can run queries
across the combined dataset, resulting in deep and immediate insight into the
causes of the anomalous reading.
The demand to run real-time analytics across data lake and operational
data to drive real-time business processes cuts across industries. In addition
to enabling predictive maintenance in an endless number of scenarios (washing
machines, copy machines, power grids, computing infrastructure, etc.), it can help:
- Credit card companies reduce the risk associated
with approving new payment charges during the checkout process.
- IT departments detect and understand security
threats in real-time to prevent breaches.
- Retailers create 360-degree customer views to
drive upsell and cross-sell opportunities.
In each case, creating Digital Integration Hub architectures
to provide the ability to query across real-time and historical data can improve
operations and drive better business decision making, improved products, enhanced
customer experiences, and increased asset utilization and ROI. Powered by
in-memory computing, real-time processing across live and historical data has become
a standard enterprise capability for leading companies undergoing digital
transformations.