By 2025 only a tiny percentage of real-time data will be stored because nowhere near enough storage hardware will exist to handle the predicted data volumes. So, what happens to the rest?
The headline from the recent IDC white paper, “Data Age 2025: The Evolution of Data to Life-Critical,” was that the data generation rate will grow from 16 ZB per year (zettabytes or a trillion gigabytes) today, to 160 ZB per year in 2025. While the white paper provided numerous insights such as the critical nature of data, security concerns, and the importance of IoT, there was one key takeaway that deserves further attention: The rise of real-time and how stream processing is essential to meet the coming massive shortfall in storage capacity.
The Data Storage Shortfall
IDC pointed out that, of the 160 ZB, about a quarter of it will be real-time data in nature (generated, processed and instantly accessible) up from around 5 percent today, and most of that real-time data (95 percent) will come from the world of IoT. But here’s the kicker: Data storage capacity has not, and will not, be able to keep up. In the words of IDC, “…the quantity of data generation can and will continue to outpace any reasonable expectation of our ability to store all of the data.”
It turns out by 2025 only a tiny percentage of data – between 3 percent and 12 percent depending on the source – will be able to be stored. Quite literally, nowhere near enough storage hardware can be manufactured to handle the predicted data volumes.
[ Related: Netflix Binges in Real Time on Its Own Data Streams ]
So, what happens to all that data? The only logical conclusion is that it must be collected, processed and analyzed in-memory, in real-time, close to where the data is generated.
Back to Basics
This concept is usually referred to as edge processing, or edge computing when related to IoT. However, it can also be applied to other data sources such as log files, network flow data, and even real-time database information (through change data capture), to identify and act on salient information.
This goes back to basic information theory. Data is not information. Data is raw, unfiltered, unprocessed facts. Much of that data can be repetitive (think high-speed sensor data), generally useless (a lot of developer-produced application log files), or lacking sufficient context (data in the form of binary code). The primary goal of Edge Computing must then be to transform data into information.
This transformation can be achieved through a combination of filtering, aggregation and change detection, and enrichment.
Processing at the Edge
Filtering is a very broad capability and uses a variety of techniques. It can range from simple (only allow error and warning messages from a log file to pass through), intermediate (only allow events that match one of a set of regular expressions to pass through), to complex (match data against a machine learning model to derive its relevance and only pass through relevant data). Since filtering acts on individual events – by either including or excluding them – it’s easy to see how this can be applied in real-time, in-memory, across one or more data streams.
Aggregation is the common term for condensing or grouping data, usually time-series data, to reduce its granularity. This can involve basic statistical analysis, sampling, or other means that retain the information content, but reduce the frequency of the data. A related notion is change detection which, as the name suggests, only outputs data when it changes. The most appropriate technique will depend on the source data and use case.
[ Related: Data Warehouse Design: A Move to Real Time ]
For example, if you have a sensor that is generating data 1,000 times a second, you may need all of the data to react to change instantly at the edge, but downstream processing might not need that level of granularity. It does, however, require all the information. The aggregation in this case could be to send information every second if the data remains constant, but send immediately if it changes. “Remains constant” could be “exactly equal,” or within some statistical range that approximates “constant.”
Aggregation of data, by definition, occurs over multiple events. As such, the scope of aggregation is usually a window of time, or other rules to retain events. Aggregation is therefore more memory intensive than filtering, as thousands or millions of events need to be kept in-memory, and requires some sizing to determine hardware requirements for edge devices.
Enrichment of data can also be essential for IoT and other use-cases. In many instances, the raw data may not contain sufficient context to be deemed useful. It could contain IDs, codes or binary data that would provide little value to upstream analysts. By joining real-time data with some context (about devices, parts, customers, etc.), it is turned into valuable information. Real-time enrichment of data streams is akin to denormalization in the database world, and will typically increase the size of data, not decrease it. However, since we have turned low-value data into high-value information, and have ideally already performed filtering and aggregation of the data, we have improved the utilization of our finite storage capacity.
Creating a Smart Data Architecture
If it is essential to consider edge processing as part of a data architecture to handle the predicted deluge of data, the edge can also be considered for additional work. More sophisticated analytics functions can also be applied at the edge. Many use cases call for very fast reaction times to complex scenarios. In these situations, bringing analytics capabilities such as real-time correlation, complex event processing, and machine learning scoring to the edge has numerous advantages.
[ Related: How Twitter Overcame Its Real-Time Data Challenges ]
The edge can be scaled as source devices and data ingestion grows, while still retaining fast response times, and often requires lower cost hardware. Analytics at the edge has the added side-effect that the source data for analytics doesn’t have to leave the edge, only the results, further reducing the need for data storage.
Of course, an overall data architecture includes more than edge processing and analytics. There may be on-premises and cloud requirements for further processing, analytics, storage and machine learning. The key is to recognize that data can be collected, processed, analyzed and stored in many different zones of a smart data architecture, and it is essential to perform each of these actions where it most makes sense to gain the greatest value from the exponential growth in data.