Best Practices for Deploying and Scaling Industrial AI
Artificial Intelligence (AI) is transforming industrial operations, helping organizations optimize workflows, reduce downtime, and enhance productivity. Different industry verticals leverage AI in unique ways.
Accelerating Manufacturing Digital Transformation with Industrial Connectivity and IoT
Digital transformation is empowering industrial organizations to deliver sustainable innovation, disruption-proof products and services, and continuous operational improvement.
Leading a transportation revolution in autonomous, electric, shared mobility and connectivity with the next generation of design and development tools.
As businesses become data-driven and rely more heavily on analytics to operate, getting high-quality, trusted data to the right data user at the right time is essential.
The goal of automated integration is to enable applications and systems that were built separately to easily share data and work together, resulting in new capabilities and efficiencies that cut costs, uncover insights, and much more.
Digital transformation requires continuous intelligence (CI). Today’s digital businesses are leveraging this new category of software which includes real-time analytics and insights from a single, cloud-native platform across multiple use cases to speed decision-making, and drive world-class customer experiences.
Best Practices for Deploying and Scaling Industrial AI
Artificial Intelligence (AI) is transforming industrial operations, helping organizations optimize workflows, reduce downtime, and enhance productivity. Different industry verticals leverage AI in unique ways.
Accelerating Manufacturing Digital Transformation with Industrial Connectivity and IoT
Digital transformation is empowering industrial organizations to deliver sustainable innovation, disruption-proof products and services, and continuous operational improvement.
Leading a transportation revolution in autonomous, electric, shared mobility and connectivity with the next generation of design and development tools.
As businesses become data-driven and rely more heavily on analytics to operate, getting high-quality, trusted data to the right data user at the right time is essential.
The goal of automated integration is to enable applications and systems that were built separately to easily share data and work together, resulting in new capabilities and efficiencies that cut costs, uncover insights, and much more.
Digital transformation requires continuous intelligence (CI). Today’s digital businesses are leveraging this new category of software which includes real-time analytics and insights from a single, cloud-native platform across multiple use cases to speed decision-making, and drive world-class customer experiences.
All businesses today are
data-driven. Those that excel use data to make fast, intelligent decisions and
take quick actions. Reaping the benefits that data has to offer requires that
data pipelines make it easy to access data and are automated. This, in
turn, will help speed up and automate business processes.
RTInsights recently sat down
with Guillaume Moutier, Senior Principal Technical Evangelist at Red Hat, to
talk about the data pipelines. We discussed why it is important to automate
them, issues that arise when implementing them, tools that help, and the
benefits such pipelines deliver. Here is a summary of our conversation.
RTInsights: We hear a lot
about data pipelines. Could you define what they are or explain what they do?
Guillaume Moutier
Moutier: Data
pipelines describe all the steps that data can
go through over its life cycle. That includes
everything from ingestion to transformation, including processing, storing, and
archiving. With this definition, just the simple copying of data, from point A
to point B, could be considered a data pipeline. But in this case, it
might just be considered a small pipe. Usually, with
data pipelines, we are talking about more complicated scenarios. It often involves different sources being merged or split
into different destinations and multiple steps of transformations happening
during this process. It comes down to this: I have some data in point A. I want
to have something at point B. Meanwhile, it must go through some transformation
or processing steps. That’s the definition of the data pipeline.
Advertisement
RTInsights: Why is it
important to automate data pipelines?
Moutier: As with any
processing that we are doing now in modern application development, automation
comes with aspects that are really, really important from a business
perspective. First, there is the management itself of the data. With automation
comes reproducibility. Usually, you would want to implement this automation as
code so that you can review everything, and you can have different versions of
everything.
Automation is the difference between why this data pipeline is behaving differently now from before. Without automation, we often cannot determine who touched it. We don’t know who made some configuration changes. These are the same issues you would consider with an application. Who made the change? Business use case improvement comes from automation by coding everything, being able to replicate, being able to reproduce things from development to production in different stages.
Plus, automation brings
scalability and the ability to reapply different recipes on different business
cases. It’s a better way to do things. As I often say, it’s the equivalent of
going from an artisanal mode to an industrial mode. Maybe the things you are
producing are really good when you’re doing this as an artisan. You are in your
workshop, and you are crafting things. That’s fantastic. But if you have to
produce hundreds of those items, you must automate the process, whether it is
preparing your parts or finishing or anything. It’s the exact same equivalent
for your data pipelines.
Advertisement
RTInsights: What are some Industry-specific
use cases that especially benefit from automating data pipelines?
Moutier:
Automating pipelines offers benefits to every industry. Whenever you have data
to process, an automated data pipeline helps. I will cite some examples. I’m
working with people in healthcare who want to automatically process data from
the patients for different purposes. It can be to automate image recognition to
speed up a diagnosis process. It can be to pre-process MRI scans or X-rays to
get to the diagnosis faster. It’s not only about getting the different data.
You have to pre-process it, transform it, and then apply some machine learning
process to generate a prediction. Is there a risk of this or can we detect that
disease in the image? That is something
that you can automate to be processed in real time.
I’m also working with another group in healthcare to speed up some ER processes. Instead of waiting to see a doctor to prescribe a treatment, further analysis, taking a blood sample, or any other exam, we are trying to implement a data-driven model. Here, a machine learning model is involved using preliminary exams, patient history, or other information like that. It can automatically predict the next exam these patients should take. Instead of waiting maybe two hours at the ER to see a doctor, now, the nurse will be able to directly send this patient to take those further tests. That’s what the doctor would have done. Of course, these models are trained by doctors and endorsed by them. It’s just a way to speed up the process at the ER.
In insurance, you might set up an automated pipeline to
analyze an incoming claim. For example, you might have an email with some
pictures attached to it. You might do some pre-processing to analyze the
sentiments in the letter. Is your client really upset, just complaining, or
making a simple request? You can use natural language processing to automate
this analysis. If this is a claim about a car accident, the attached pictures
are supposed to be of the damaged car. You can automatically detect if a
picture is indeed a picture of a car and not a banana or a dog or a cat. But
more seriously you can tag it with information gathered from the image:
location, weather conditions, and more.
Those kinds of automated pipelines can speed up any kind of
business process. As a result, automating data pipelines applies to any
industry that wants to speed up some business processes or tighten those
processes.
Advertisement
RTInsights: What are some
of the challenges in automating data pipelines?
Moutier: The first
challenge that comes to mind is the tooling. You must find the right tools to
do the right job, whether the tools are used for ingestion, processing, or
storing. But where it gets difficult is that nowadays there are so many
different tools or projects, especially when considering open-source projects. And the number is growing at a really fast pace. If you are looking at common
tools nowadays, and here I’m thinking for example of Airflow or Apache NiFi or
similar tools that will help you automate those processes, their use changes
rapidly. Often the tools are only mainstream
for a year or year and a half, then they will be replaced by something else.
The pace at which you must track all the
tooling is a real challenge.
On top of that, another
challenge that I’ve seen people are struggling with is a good understanding of
the data itself, its nature, its format, its frequency. Does it change often?
Especially with real-time data, you must
understand the cycles at which the data may vary.
Also, the weight of the data
can be an issue. Sometimes people are designing data pipelines that look
fantastic on paper, “Oh, I’m going to take this and apply that.” And
it works perfectly in the development
environment because they are only using 100 megabytes of data. But when your
pipeline is handling terabytes, even petabytes of
data, it may behave differently. So, you need a good understanding of the
nature of the data. That will help you face the challenges that come with it.
Advertisement
RTInsights: Are there
different issues with pipelines for batch vs. real-time vs. event-driven
applications?
Moutier: The issues,
of course, can be seen as different. But from my perspective, they are
different mainly in the way they are tackled. The root causes of the issues may
be the same. It comes down to scalability and reliability. Let’s take an
example. What do I do when there is a loss in the transmission or a broken pipe
in my pipeline? It can happen to batch processing or real-time processing. But
you do not use the same approaches to solve this problem in each case. If there is a broken pipe with
batch processing, it’s almost not a major problem. We just restart the
batch process.
You need a different approach
in a real-time infrastructure or event-driven data pipeline. You have to
address the issues from the business perspective. What happens when a server
goes down? What happens when there is a security breach? It’s all those root
causes that will help you identify exactly the challenges you face in those
different cases and what solution to apply to them. It must be driven with
these considerations in mind. It is not only that there is a specific issue,
but where does it come from? And never forget what you want to achieve because
that’s the only way to apply the proper mitigation solution to those issues.
Advertisement
RTInsights: What technologies
can help?
Moutier: A data
streaming platform like Kafka is something that is
almost ubiquitous now. More and more, it is used on almost all types of
pipelines, event-driven, real-time, or batch. You do not see any modern
application development or data pipelines without Kafka at some step. That’s a
great tool allowing many different architectures to be built upon it.
I’ve also seen more and more
serverless functions being used. They are fantastic, and especially if you are
working in Kubernetes or in an OpenShift environment. Serverless, which is the KNative implementation into OpenShift, is a perfect match for an
event-driven architecture. It brings you scalability to go down to zero in
terms of resource consumption and scale up to
whatever you want, depending on the flow of data coming in.
If we go back a few years ago,
we had those batch processing servers sitting idle 24/7, working at 2:00 a.m.
during the night to do some processing on data. That’s a huge consumption of
resources for something that we run 10 minutes a day. That doesn’t make any
sense. Today you would use cloud-native
architectures and tools developed for Kubernetes for your data pipelines.
Another technology that helps is an intelligent storage solution, such as Ceph. In the latest releases of Ceph, you have, for example, bucket notifications. That means your storage is not some dumb dumpster where your data is coming in. Now, it can react on the data. And it can react in many different ways. With bucket notifications in Ceph, you can send a message to a Kafka topic or just to an API endpoint saying, “There is this event that happened on the storage, and this file is being uploaded, modified, or deleted.”
Also, upcoming in Ceph, a feature like S3
Select is a fantastic add-on for analytics workloads. Instead of bringing all
the data up to the processing cluster, you only retrieve the data you are
interested in. You select this directly in the source, and you retrieve for
processing only the data you want. So that’s
the kind of feature that makes storage now a more interesting part of the
pipeline you can play with. It’s part of the architecture, and again, making
storage much more than just a simple repository of
data.
Advertisement
RTInsights: Once you
automate the data pipelines, what are the benefits?
Moutier: Scalability,
reliability, security, always being able to know what’s happening because it’s
fixed by code. If you have automated your pipeline, that means at some point,
you have coded it. What runs is exactly what is supposed to run. From a
business perspective, that gives you a great advantage.
Once you have intelligently
implemented your data pipeline, it’s easy to multiply the outcomes. For
example, you could directly send data from your storage to processing or an
event stream to processing. That’s great. But suppose you have taken the step,
for example, to put Kafka in the middle. In that case, that means that even if
you have started with only one consumer for
the data, it’s easy to add another one, and add hundreds of different consumers
to the topics.
If you architect from the
start with automation and scaling in mind, it gets easier to complement a
pipeline with other processes, branches, or any other processing you want to do
with the data.
Salvatore Salamone is a physicist by training who writes about science and information technology. During his career, he has been a senior or executive editor at many industry-leading publications including High Technology, Network World, Byte Magazine, Data Communications, LAN Times, InternetWeek, Bio-IT World, and Lightwave, The Journal of Fiber Optics. He also is the author of three business technology books.
Retailers cannot meet modern consumer expectations through procurement alone. They need a holistic technology management strategy that prioritizes maintenance, security, compliance, resiliency, and visibility.
Businesses using self-service BI often find themselves in a tug of war between too much and too little control. Smart governance can empower enterprise teams to fearlessly derive the insights they need at the speed their business demands.
The evolution from Chief Data Officer to Chief Data, Analytics, and AI Officer reflects the shift in how enterprises derive value from their digital assets.
The low-altitude economy depends on continuous, real-time ingestion and processing of high-volume telemetry, sensor, and imagery data to support safe, scalable operations.
Organizations that align their data strategies with 2026 cloud evolution trends will be well-positioned for success in the modern AI-dominated business world.
Discover how data governance in ERP systems lays the foundation for business success by ensuring data accuracy, compliance, and efficiency. Learn key strategies and best practices.
Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.
Advertiser Disclosure: Some of the products that appear on
this site are from companies from which TechnologyAdvice
receives compensation. This compensation may impact how and
where products appear on this site including, for example,
the order in which they appear. TechnologyAdvice does not
include all companies or all types of products available in
the marketplace.