SHARE

Considerations for Successful Continuous Data Ingestion and Analysis

Digital Streams series. Arrangement of numbers, lights and design elements on the subject of digital communications, data transfers and virtual reality

Given the many ways CI applications work with data, organizations need flexible, high-performance, highly scalable data ingestion and analysis solutions.

Written By

Salvatore Salamone

Mar 18, 2020

6 minute read

Continuous intelligence (CI) relies on the rapid analysis of streaming data. For many CI applications, the question is where the analysis is done. There is no one good answer. The decision about where to do the analysis will be based on several factors, including where the data is generated, how much is generated, and how fast the analysis must be done.

Many CI applications will use data generated off-premises and, in some cases, from widely distributed sources. For example, an autonomous vehicle would make use of telemetry and video streams, while a supply chain and logistics application would use status data from hundreds, thousands, or hundreds of thousands of Internet of Things (IoT) devices.

Where to conduct the analysis of such data comes down to two basic choices: The analysis can be done at the point where the data is generated; or, it or it can be transmitted to a regional or geographic cloud compute facility or to corporate data center. It could also be sent to a combination of locations in client/server style. For example, an edge device might be used for facial detection, discarding image frames with no faces. Images with faces are sent to a central location for facial recognition, trading off bandwidth for compute power.

The tradeoffs to consider when deciding where to do the analysis include the following:

Availability of compute power. CI applications need to ingest streaming data and then perform real-time analysis. The results are needed in milliseconds to seconds in order to give a human information to take actions or have a system automatically take actions based on the situational awareness derived from the analysis. If there is adequate processing power at the source of data generation to handle these processes, the analysis can be done at that point. In cases where there is not enough processing power or processing power at the source costs too much, the data will need to be streamed to a company data center or a cloud provider’s facility.

Connectivity. In some situations, the data source may come from a device that is part of a corporate network. For example, a manufacturing plant might have IoT devices throughout a factory, all of which are on a Wi-Fi network. In such cases, the data can be transmitted to a central system for analysis. However, there are scenarios where there is no connectivity. For instance, consider IoT sensors monitoring the performance of downhole boring equipment in a remote oil and gas exploration field. Such devices may not have any way to get the data from the field to a central location for ingestion into an event-processing engine for analysis. In such a case, the analytics must be run locally.

Bandwidth. In other situations, a device might have a communications path to send data to a location for analysis, but that proves not to be the best option. If bandwidth costs to transmit the data are too high, it may not be economically practical to send the data to a central location for analysis. In such cases, it makes more sense to have a CI app ingest the data locally and perform the analysis on-site.

Latency. In yet another scenario, there may be both connectivity and suitably priced bandwidth, but transmission time is a factor. A prime example of where this issue arises is with autonomous vehicles. Suppose there is enough reasonably-priced bandwidth to get a video stream from the car to a data center for analysis. There may not be enough time to transmit the stream, perform the object analysis, and send information back (that’s a pedestrian walking in front of the car) to take evasive actions. Again, in this case, the analysis would need to be done at the source.

Making a choice about where to do the analysis might be easier thanks to some emerging technologies. Developments to watch include:

5G. Many service providers are rolling out 5G services that offer low latency, high-bandwidth connectivity. In fact, many early 5G offerings are targeting markets that need IoT connectivity.

Edge computing. While 91 percent of data is currently created and managed within centralized data centers, within the next two years, 75 percent of that data will need analysis and action at the edge. The market is responding with the introduction of new edge compute systems and edge cloud services.

Multi-Access Edge Computing (MEC). Some technology and service providers are teaming up to offer a service that combines 5G IoT connectivity with edge computing services. These service combos embed cloud compute and storage services at the edge of a provider’s 5G network.

Real-Time vs. Long-Term Issues to Consider

While making the decision as to how the data will make its way from the source of generation to a compute system for analysis, other factors also should be considered.

CI applications use large volumes of streaming data. What happens to the data after the real-time analysis is performed? Does an organization keep just the insights derived in the analysis? Or should it keep the entire data stream as well?

We are talking very large volumes of data that will need storage capacity (which is not free). That data will need to be managed and protected over time. Given the costs and staff time needed, why would anyone save the data?

There are two key issues that will likely force an organization’s hand. First, the data may need to be retained for regulatory or legal reasons. There may be specific regulations on a type of data that mandate retention for a set number of years. If actions are taken on data and there is a problem (e.g., an autonomous car crashed into a telephone pole), there may be a need for an audit trail to show why decisions were made. In other cases, an organization may have an internal policy to save data for a specific time (say six months) for different uses before discarding it to lower the legal risks of discovery via subpoena.

Second, an organization may want to use the captured streamed data for further analysis. For instance, streaming data about the state of equipment on an assembly line may be used by a CI app to spot potential problems in real time. Alerts from such an app could be used to fix problems before they happen, thus reducing downtime. If the data is retained, additional analysis could be performed to look for trends (e.g., on average, the motor from this vendor has an 80 percent chance of breaking down in two weeks after its internal temperature exceeds a certain temperature).

Finding a Real-Time Streaming Data Analysis Partner

Given the many ways CI applications work with data, organizations need flexible, high-performance, highly scalable data ingestion and analysis solutions.

IBM has been a leader in helping organizations turn streaming data into real-time insights. IBM addresses streaming data ingestion and analysis issues with IBM Cloud Pak for Data.

IBM Cloud Pak for Data is a fully integrated data and AI platform that helps businesses collect, organize, and analyze data and infuse AI throughout their organizations. Built on Red Hat OpenShift, IBM Cloud Pak for Data integrates IBM Watson AI technology with IBM Hybrid Data Management Platform, DataOps, data governance, streaming analytics, and business analytics technologies. Together, these capabilities provide the architecture for CI that can meet ever-changing business needs.

IBM Cloud Pak for Data is easily extendable using a growing array of IBM and third-party services. It runs across any cloud, allowing businesses to integrate their analytics and applications to speed innovation.

Complementing IBM Cloud Pak for Data, IBM Cloud Pak for Data System is a cloud-native data and AI platform in a box that provides a pre-configured, governed, and secure environment to collect, organize, and analyze data. Built on the same Red Hat OpenShift Container Platform, IBM Cloud Pak for Data System gives businesses access to a broad set of data and AI services and allows quick integration of these capabilities into applications to accelerate innovation. The hyperconverged, plug-and-play system is easily deployable in four hours.

To learn more about how a cloud-native architecture can help with your CI efforts, visit Cloud Pak for Data.

Read the other blogs in this series:

Salvatore Salamone

Salvatore Salamone is a physicist by training who writes about science and information technology. During his career, he has been a senior or executive editor at many industry-leading publications including High Technology, Network World, Byte Magazine, Data Communications, LAN Times, InternetWeek, Bio-IT World, and Lightwave, The Journal of Fiber Optics. He also is the author of three business technology books.

Considerations for Successful Continuous Data Ingestion and Analysis

Salvatore Salamone

Featured Resources from Cloud Data Insights

Company

Categories