SHARE

Your AI Is Only as Smart as Your Metadata

Delivering the right metadata for each AI job improves the accuracy of the results and delivers valuable insights for an organization and its constituents.

Written By

Paul Chen

Mar 3, 2026

In IT circles, AI data management is still a nascent practice, but it has become center stage right now. It wasn’t long ago when IT teams were looking to copy all their data into data lakes in the cloud, where the latest AI tools and services live. This tactic soon revealed itself as a recipe for high cloud bills and waste.

Now, as organizations scale AI beyond pilots into production, technical leaders are discovering that raw data volume is not the issue. Specificity is the new game. That means curating data that not only complies with security and governance requirements but can meet the exact needs of the project at hand.

It’s not hard to find evidence of the dire need to properly prepare data for AI workflows to achieve ROI. Informatica’s CDO Insights 2025 survey reports that data quality and readiness are a top obstacle to AI success. Gartner predicts that through this year, organizations will abandon 60% of AI projects because data is not AI-ready.

What sets AI workloads apart?

AI workloads, especially RAG and inference pipelines, require continuously curated, deduplicated, and metadata-rich datasets. The trouble is that the unstructured data feeding AI pipelines is difficult to find and leverage because it is widely distributed and lacks context for search. After all, this data has been piling up for decades with very little oversight or quality controls.

It is becoming increasingly apparent that to search across large volumes of unstructured data efficiently, enterprises need to enrich metadata aligned with niche industry and security requirements. Rich metadata that is easily discoverable is how data scientists and researchers can rapidly curate what they need for AI projects and nothing more.

With better metadata, you not only give AI project leaders a pathway to better outcomes, but you can also cut costs by 50-90% because you can send much less data to your AI processors. Increased investment in AI has led to vastly higher storage and infrastructure costs for enterprises, due partly to the need for high-performance, AI-ready GPU storage.

Yet other factors are also in play, including higher egress fees as data moves more frequently between clouds and apps and higher infrastructure costs due to the current supply chain squeeze on memory and SSDs. The latter is resulting in doubling prices for storage and compute. By sending just the right data to AI, you shrink the amount of storage capacity needed while improving AI accuracy.

The elephant in the room is that unstructured data is noisy and messy. It is poorly classified, leading to wholesale dumps of petabyte-size data sets into cloud storage for use in AI. Avoiding this highly inefficient tactic is critical to keep budgets in check and AI projects moving forward.

Meeting the growing requirements for advanced metadata enrichment

File systems automatically generate metadata for unstructured data. This metadata indicates file characteristics such as time of creation, owner, time of last access, and file type. Time of last access is especially helpful for creating cold data tiering policies; by knowing which data is active and which data is cold, IT can move rarely-accessed data to archival storage for cost savings. This metadata is useful but basic and often difficult to gather, as it is scattered in the silos where data is stored. Richer and more specific metadata is needed to achieve better accuracy from AI systems. Such systems often benefit from contextual metadata, such as what project or finance code a file belongs to or the body part imaged by a medical scanner. This capability is available in some security and unstructured data management tools, or by connecting to third-party cloud AI-based scanning and classification tools.

Examples of contextual metadata:

A healthcare research director may want to read custom metadata headers from medical DICOM files for tagging.
A manufacturing data analyst needs to apply ERP project tags to R&D files.
A law firm IT director needs to import sensitive data labels from Microsoft Purview to client files.
A genomics researcher wants to import project context from other platforms, such as Electronic Lab Notebooks (ELNs).
A media and entertainment company wishes to tag files with specific EXIF metadata from digital media assets.
A bank wants to mask sensitive data before feeding it to AI.

Traditional ETL models of writing plug-ins and connectors to produce these richer types of metadata are a common practice, yet it is an approach that doesn’t scale well for today’s data volumes and fast-changing AI requirements. By the time you spend months developing a custom plug-in, the requirements have already evolved, and AI cycles are moving too fast to afford such lengthy development cycles.

IT teams also must create and manage such rich metadata while maintaining compliance with corporate AI governance, finding PII, PHI, PCI, and other sensitive information, and filtering it out of data sets sent to AI systems. This is yet another monumental task when scaled to billions of files across many different storage systems and global sites. In the age of low-code development and AI, these costly custom operation processes are dated and no longer viable. Enterprises need a modern, agile framework that can deliver rapid development and customization of metadata extraction and enrichment.

Automating Custom Metadata for AI

Enterprise IT teams will benefit from a more automated, orchestrated process to develop and run custom data actions for AI. This will save time and effort and expand the possibilities for unique AI data pipelines, aligning with the strictest corporate governance policies.

Ideally, IT users should be able to focus on the requirements for enriching metadata to meet project requirements, while avoiding the time sinks of executing instructions across large data sets and infrastructure and filtering sensitive information from data sets and metadata to comply with privacy and security regulations.

Automation can take care of provisioning the compute infrastructure, repeating the function across a large dataset, which can easily be petabytes of data and billions of files, scaling as needed, and deprovisioning any cloud resources used once the operation is complete.

Of course, one should look for guardrails in systems to ensure that custom operations cannot delete or modify any data. IT should have controls on who can create custom actions and the approvals needed to put them into production. But, with the proper controls, custom actions provide a powerful way to rapidly tailor metadata extraction to each enterprise’s needs without building full-blown connectors and plug-ins.

Conclusion

Deeply customized data actions are integral to driving positive AI outcomes and measurable progress for enterprises. These data operations add structure to unstructured data so that users can search across it for AI data sets. Additionally, doing this correctly will deliver the right metadata for each AI job, improving the accuracy of the results and delivering valuable insights for the organization and its constituents. The data needs for AI are getting more complex all the time. IT leaders will need to continue to refine their data management ecosystems and architectures so they can flex with new demands as they arise.