Why Most AI Projects Fail Before They Reach the Algorithm - RTInsights

Why Most AI Projects Fail Before They Reach the Algorithm

Why Most AI Projects Fail Before They Reach the Algorithm

The AI revolution isn’t happening in GPUs or model architectures. It’s happening quietly in the data pipelines that feed them.

Apr 23, 2026
4 minute read

Every executive wants to talk about the latest transformer architecture or GPU breakthrough. Meanwhile, most AI projects die quietly in the data preparation phase.

Research from Cognilytica shows that over 80% of the time in a typical machine-learning project is spent on data work—identifying, cleaning, labeling, and organizing the data—rather than the model training that gets most of the attention.

After more than a decade of building AI systems, I’ve seen the same pattern repeat: data scattered across silos, slowing every initiative. The organizations that succeed aren’t necessarily the ones with the biggest models or the most GPUs. They’re the ones that solve data accessibility early, creating a foundation that scales with their AI maturity.

The Equation Everyone Gets Wrong

NVIDIA’s Jensen Huang has argued that “every company’s business data is their gold mine,” and that abundance of data is strategic—but volume alone won’t save a project; usability will.

Think of the equation this way: AI = Models × Compute × Data.

Most organizations pour resources into the first two variables and treat data as an afterthought. But raw information lives everywhere—logs, spreadsheets, APIs, sensor streams—fragmented across SaaS tools and cloud silos. Before you can train anything meaningful, you have to solve the unglamorous problem of how data moves, lives, and connects.

It’s not hoarding if it’s data. But unless that data is structured, searchable, and governed, it’s just digital clutter.

See also: Is AI Compute Becoming the Next Bottleneck?

The Five Stages Where AI Projects Win—or Fail

Every successful AI initiative follows the same broad path: ingest, process, train, deploy, and monitor. Understanding what data flows through each stage—and how your systems enable that flow—determines whether the project scales or stalls.

1. Ingest and Archive: This is where many teams stumble. Raw data exists in dozens of formats across disconnected systems. Smart ingest means centralizing it into a structured, searchable archive. Unified formats and strong metadata make future discovery possible.

2. Data Processing: This is the notorious 75%. Cleaning, labeling, and engineering features from raw data consume most of the time. As Peter Norvig has long argued, while more data often wins against clever algorithms, the real edge comes from better-quality data. Good storage practices—versioning, traceability, and tamper-proof backups—turn chaos into a repeatable process.

3. Experimentation and Training: By this stage, the heavy lifting is done. Now throughput matters. Slow data transfer can double your cloud bill. I’ve seen organizations cut training costs by more than half simply by improving data access and eliminating idle GPU time.

4. Deployment and Inference: Production models need reliable, versioned data pipelines. But deployment isn’t the finish line—it’s a feedback loop. User interactions, predictions, and performance metrics all feed back into your systems. Latency and throughput at this stage directly affect both experience and margin.

5. Monitoring and Feedback: Continuous monitoring catches model drift and feeds fresh insights into future training cycles. Smart teams treat inference logs and customer feedback as valuable new training data, not digital exhaust.

Why Infrastructure Matters More Than Algorithms

Every decision about where and how data lives compounds across the AI lifecycle. Slow loading extends training times, driving up compute costs. Manual data movement burns engineering hours. Lack of version control means losing work when something breaks.

As models combine text, images, audio, and video, the storage challenge multiplies. A video is exponentially heavier than an image, which is heavier than a text file. That growth demands a foundation built for both performance and cost efficiency.

In my experience, four principles consistently separate leaders from laggards:

  • Performance when it matters: Match storage performance to workload needs—fast where it counts, efficient everywhere else.
  • Integrity and traceability: Version control and tamper-proof storage protect your data and reputation.
  • Cost-aware lifecycle management: AI creates mountains of intermediate data. Automated tiering controls cost without losing critical history.
  • Proximity to compute: Keep data close to where models train and run. Distance equals latency.

See also: How AI Is Forcing an IT Infrastructure Rethink

Advertisement

The Economics of Getting It Right

The winners in AI won’t necessarily have the most advanced models or the biggest clusters. They’ll be the ones who treat data infrastructure as a strategic advantage.

Startups often struggle to gather enough data; once they have it, architecture and cost become the barriers. Enterprises face the opposite challenge: plenty of data, but trapped in silos. Both succeed when they think in pipelines, not projects—treating data movement as an ongoing capability, not a one-off exercise.

They also design for developer velocity. Data scientists should spend time on experiments, not storage administration. Standardized APIs and automation help teams move faster.

And they plan for governance. As AI enters regulated industries, knowing where your data came from and how it’s used is essential for both compliance and trust.

See also: Groups Focus on Infrastructure for AI and High-Performance Workloads

The Foundation That Makes AI Possible

Storage used to be an afterthought—a place to park data once the work was “done.” Now it’s an active tool: a foundation that enables fast access, efficient movement, and reliable orchestration across environments.

The AI revolution isn’t happening in GPUs or model architectures. It’s happening quietly in the data pipelines that feed them.

Get the foundation right, and everything else becomes possible. Get it wrong, and no amount of algorithmic brilliance will save you.

Jeronimo De Leon

Jeronimo De Leon is the Senior Product Manager at Backblaze. He is a seasoned product management leader with over 10 years of experience driving AI-driven innovation across enterprise and startup environments. Currently serving as Senior Product Manager, AI at Backblaze, he leads the development of AI/ML features, focuses on how Backblaze enhances the AI data lifecycle for customers' MLOps architectures, and implements AI tools and agents to optimize internal operations.

Recommended for you...

English as Code and the End of Drag-and-Drop Thinking
Binny Gill
Apr 22, 2026
MCP: The USB-C Port for AI, Yes or No?
Is AI Compute Becoming the Next Bottleneck?
Akhil Verghese
Apr 20, 2026
Real-time Analytics News for the Week Ending April 19

Featured Resources from Cloud Data Insights

Why Most AI Projects Fail Before They Reach the Algorithm
Jeronimo De Leon
Apr 23, 2026
English as Code and the End of Drag-and-Drop Thinking
Binny Gill
Apr 22, 2026
MCP: The USB-C Port for AI, Yes or No?
Is AI Compute Becoming the Next Bottleneck?
Akhil Verghese
Apr 20, 2026
RT Insights Logo

Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.