SHARE

Why Data Gravity is Winning in the AI Era

Abstract 3d twisted portal. Square tunnel or wormhole. Digital background with connected green dots. 3d rendering.

In the end, data gravity is not “winning” because the industry suddenly became nostalgic for on-premises infrastructure. It is winning because enterprises are rediscovering the obvious fact that the fastest, most secure, and most economical path to insight moves the least data.

Written By

Ugur Tigli

Jun 4, 2026

6 minute read

For years, enterprise data architecture was built around moving your data to wherever the compute is, to get the most value.

That assumption worked well enough in the warehouse era. Analytics jobs were relatively predictable, the data was mostly structured, and the cost of copying it from one environment to another was tolerable. If a team needed a new dashboard, model, or data product, the standard answer was to build another pipeline.

Today, however, that strategy no longer works due to AI.

In the AI era, “time-to-insight” is no longer just a productivity metric, but is now increasingly becoming a financial metric. The longer it takes to make fresh data available for training, inference, or decision-making, the more value is left on the table. And when expensive compute sits idle waiting on data movement, that delay shows up not only in project schedules but in infrastructure economics.

This is why data gravity is winning again.

Despite years of investment in modern lakehouse architectures, many organizations still rely on a copy-first operating model. Data is replicated into cloud regions for analytics, synchronized into separate environments for AI, exported for partners, and staged again for governance or compliance review. Every new use case seems to create another copy, another policy surface, another pipeline, and another failure mode.

At a small scale, this feels manageable. At enterprise scale, it becomes technical debt.

The hidden cost of the copy-first lakehouse

The problem with moving data is not only bandwidth or storage cost. The deeper issue is operational sprawl. Every replicated dataset immediately raises four questions: which copy is current, which is governed, which contains sensitive fields, and which downstream systems are actually using. What began as architectural flexibility becomes operational sprawl and a data management headache.

This matters most in environments where moving data is difficult for reasons unrelated to technology preference. For instance, regulated industries face data sovereignty requirements that prevent data from being moved across jurisdictions. Edge and manufacturing environments generate the most valuable data outside the core cloud estate, where low-latency action matters more than centralized aggregation. At petabyte scale, each additional copy adds latency, multiplies governance surface, and raises breach risk. “Just replicate it” stops being an answer and becomes the problem.”

Open table formats like Apache Iceberg and Delta Lake were a genuine step forward in making data lakes transactional and interoperable, but they only solved the format problem and not the movement problem. Most architectures still lack a production-grade way to share data with governed access across tools and teams without copying it first.

What organizations need next is a more practical model for governed access to data where it already lives. The answer is query-in-place. Data stays where it is created, governed where it resides, and accessed in place when teams need it. That means no extra ETL pipeline to maintain, no replication lag between systems, and no shadow copy that expands the audit surface. The real challenge is not inventing another way to move data, but making in-place access fast, secure, and practical in production.

Most architects agree that fewer copies are better, but in practice, many sharing approaches fail when they meet production requirements. If you’ve ever tried to deploy a sharing service using a reference-style implementation, you’ve seen the same failure modes show up repeatedly.

The first issue is security hygiene. A design that looks elegant in a diagram often becomes brittle when real access control, token lifecycle management, encryption policies, and audit expectations are applied. A sharing mechanism is not production-ready simply because it can expose a table. It must fit cleanly into the organization’s identity, security, and policy model.

The second issue is operational behavior. In too many systems, configuration changes are disruptive, service boundaries are unclear, and day-two operations become heavier than expected. What begins as “simple sharing” grows into a set of sidecar services, gateways, proxies, restarts, and custom scripts that must be maintained indefinitely. The operational cost creeps in slowly, then all at once.

The third issue is network reality. Enterprise environments are segmented by design. There are private domains, restricted zones, partner boundaries, and edge locations with inconsistent connectivity. A sharing model that assumes flat connectivity or frictionless trust between systems is unlikely to survive contact with a real enterprise.

Finally, there is endpoint sprawl. Many organizations attempting to reduce copies end up replacing them with something equally problematic, such as a collection of loosely managed synchronization services and shared endpoints scattered across teams. The data isn’t duplicated, but the operational surface is, and it becomes harder to govern than a replicated dataset because ownership is diffuse.

A sharing model can’t just work in a demo. The question is whether it reduces the total infrastructure you need to own and operate, including security, networking, operations, and governance. If it doesn’t, it’s just complexity with better branding.

Where architects should start: put sharing in the data path

Hybrid analytics at enterprise scale breaks down when data sharing is treated as a separate layer bolted on top of storage. The sharing plane needs to live with the data, inheriting the same durability, scalability, and security properties as the storage system itself.

This means three things in practice:

1) Security that travels with the data. Identity and authorization need to sit in the access path. If the sharing plane is separate from the system of record, authorization often becomes fragmented, configured in one place, enforced in another, and audited somewhere else.

A better model is authorization in the access path:

Use enterprise identity (OIDC/SAML integrations, service principals, workload identities).
Prefer short-lived credentials (JWT/OAuth-based flows, scoped tokens) over static secrets.
Enforce policies at request time, not only at publish time.
Produce audit logs that map who accessed what and when, across environments.

2) Operations that don’t require downtime to change. Once sharing becomes part of the analytics and AI workflow, it also becomes part of the availability requirement. Policies, endpoints, and configurations will evolve constantly as datasets, users, and projects change. That makes hot-reloadable configuration, safe rollout patterns, and predictable failure behavior essential. If routine changes require restarts or service disruption, the architecture will struggle in real production environments. The goal is for the sharing capability to scale with the underlying data platform, even embedded to it.

3) Scale that matches the underlying platform. Hybrid enterprises have many environments, including factories, regions, clouds, business units, and subsidiaries. If each environment requires a bespoke sharing deployment, the sharing plane becomes a fleet.

The goal is for the sharing capability to scale with the underlying data platform—the same way storage scales: add nodes, add capacity, add throughput, and the sharing layer scales with it. The fewer moving parts you introduce, the more likely you are to maintain consistent security.

A final word

In the end, data gravity is not “winning” because the industry suddenly became nostalgic for on-premises infrastructure. It is winning because enterprises are rediscovering the obvious fact that the fastest, most secure, and most economical path to insight moves the least data.

The modern lakehouse needs a secure sharing plan. If you get that layer right, hybrid analytics becomes a solvable engineering problem rather than a permanent compromise.

And in the AI era, that’s the difference between systems that look modern on paper and systems that actually deliver outcomes in production.

Ugur Tigli

Ugur Tigli is CTO at MinIO, overseeing enterprise strategy and working with its enterprise client base. Ugur has almost two decades of experience building high-performance data infrastructure for global financial institutions. Prior to MinIO, he was senior vice president, global head of hardware engineering, at Bank of America. Ugur joined BofA through the acquisition of Merrill Lynch, where he was the vice president for storage engineering. Ugur has a Bachelor of Science in electrical engineering from Lafayette College.