By some estimates, collecting, curating, and tagging data accounts for about 80% of the effort in modern AI projects.
Continuous intelligence (CI) applications need to perform analysis on streaming data and make decisions in milliseconds to minutes. Many traditional approaches to data staging, ingestion, and analysis are not adequate. What’s needed is a purpose-built data architecture that can accommodate steaming data and its fast analysis, which is typically aided by artificial intelligence (AI).
The bottom line is managing data is at the heart of all CI projects. Data is needed to train AI models, and then an application must be able to make use of suitable data streams for the analysis that produces information upon which to act in the moment.
Why is data such an issue with CI? By some estimates, collecting, curating, and tagging data accounts for about 80% of the effort in modern AI projects. And CI applications that use AI to derive actionable information require a constant stream of data.
In contrast, most analytics efforts work differently. In a typical traditional analytics application scenario, a source of data is identified. In many cases, the original data is maintained in a structured database. Then, access is provided to that data. Next, the data is put into the right format for use by the application.
With CI, often, there are multiple data sources, and many include streams of data that are being created in the moment. Examples include telemetry data from sensors or Internet of Things (IoT) devices, social media conversations, clickstreams, or event logs.
Such data must be analyzed as it is being created. It loses value with age. Deciding which product to offer a customer after they hang up misses that one opportunity to engage in real-time. Similarly, determining that a suspect transaction is indeed fraudulent after it is executed does a financial services organization no good. So, ingestion and analysis speed are critical.
The first step to a successful AI effort then is managing the deluge of unstructured data that is pouring in from siloed, disparate systems. Such requirements should be used to define infrastructural capabilities for data to be captured rapidly, stored safely, accessed transparently, and shared globally wherever and whenever.
Adding to the Challenge
With CI applications, data collection is not a one-time thing – it’s a continuous process (as the name implies). As such, CI projects based on AI analysis must adopt new data collection and curation strategies. Building an AI-ready data infrastructure requires more than basic pipeline functions used to ingest, cleanse, transform, and validate data.
Since data can come in many forms, and from many sources, the data infrastructure needs to be flexible. Data can range from large binary objects to small files to JSON-formatted key-value pairs. Access requirements can vary depending on the stage in the AI data pipeline. For example, a fast SSD-backed parallel file system or a distributed Cassandra database may be optimal for ingesting real-time streaming data. Video or image files may best be placed in a local or cloud-resident object-store. Tools such as TensorFlow, PyTorch, and Spark expect to access data in different ways using native methods – as examples, via HDFS, via an AWS S3 compatible object API, or using a standard POSIX file system interface.
An effective data infrastructure needs to support system-generated metadata from diverse data sources (object stores, file systems, cloud repositories, and more) as well as user-defined metadata. It also needs to provide mechanisms to make these tags accessible to higher-level machine learning frameworks regardless of the underlying storage technology.
The data infrastructure should ideally support auto-tagging (extracting tags from existing metadata) or using deep inspection policies to pull text and metadata directly from raw data files using tools such as Apache Tika.
In some cases, a data extractor may be a pre-trained model, such as a program that classifies images or infers customer sentiment from various types of correspondence. Increasingly, businesses that want to build CI applications that use AI need high-quality data to train the AI models. Such a need has created a new market for data annotation services. The companies that provide such services provide greater value than a public crowdsources service might offer. Instead, this new breed of companies use highly trained data labelers, and many develop their own advanced annotation tools.
The new data labeling companies differentiate themselves from traditional crowdsourcing platforms that offer labeling services. The companies in this new category often tout their offerings as managed data labeling services. They deliver domain-specific labeled data that undergoes quality control.
Naturally, scale and performance are also critical to an AI-ready data infrastructure. Businesses will almost certainly introduce new models and enhance existing models to include additional data sources over time.