SHARE
Facebook X Pinterest WhatsApp

Apache Hudi: How Uber Gets Data a Ride to its Destination

thumbnail
Apache Hudi: How Uber Gets Data a Ride to its Destination

Blue binary tunnel that suggests computer data flow. Communication concept.

Apache Hudi provides tools to ingest data into HDFS or cloud storage, and is designed to get data into the hands of users and analysts quickly.

Written By
thumbnail
Joe McKendrick
Joe McKendrick
Jun 4, 2021

At a busy, data-intensive enterprise such as Uber, the volumes of real-time data that need to move through its systems on a minute-by-minute basis reaches epic proportions. This calls for a data lake extraordinaire, in which data can immediately be extracted and leveraged across a range of functions, from back-end business applications to front-end mobile apps. Uber depends on up-to-the-minute bookings and alerts as part of its appeal to customers, so its reliance on real-time data streaming platforms is off-the-charts. It has turned to Apache Hudi, an emerging platform that brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing.

I recently had the opportunity to moderate a webcast about Apache Hudi with Nishith Agarwal and Sivabalan Narayanan, both engineers with Uber. Both Agarwal and Narayanan are active members of the Hudi programming committee.

The Hudi data lake project was originally developed at Uber in 2016, open-sourced in 2017, and submitted to the Apache Incubator in January 2019. Apache Hudi data lake technology enables stream processing on top of Apache Hadoop compatible cloud stores and distributed file systems. The solution provides tools to ingest data onto HDFS or cloud storage, as well as provide an incremental approach to resource-intensive ETL, Hive, or Spark jobs. It is designed to get data into the hands of users and analysts much quicker.

At Uber, “Hudi powers many different use cases,” says Agarwal, noting that the company’s enterprise data lake is built on Hudi. “We have about 250 petabytes of data that’s managed by the data lake platform. The kinds of use cases that it enables are, for example, whenever you build machine learning pipelines. One of the challenges are if data is changing upstream, and I want to update my feature set, how do I update my feature set without actually reading the entire data and re-snapshotting it? That becomes a really costly process. For example, if we run the data models for UberEats, which are massive, hundreds and hundreds of terabytes and consuming that data becomes tricky. One of the ways where Hudi is being employed is to make all of this incremental, with all of these primitives.”

Another use case is around managing earnings data, Argawal continues. “As we go through all of the business use cases that Uber has, exposing different data to different customers to different users, how do we do that in an efficient way? How do you point out exactly where the data lies and then be able to expose this data again to the record level, indexing all of these things? Hudi helps immensely in those kinds of use cases.”

Going forward, Argawal anticipates tighter integration with other streaming platforms such as Kafka. “Generally, Hudi will connect to Kafka directly and pull streams. Kafka Streams itself is also an execution framework, like Apache Fling, but has some custom semantics, and right now, there is no support for running Hudi on Kafka Streams, but we are looking at providing connectors that may be able to do that.”

thumbnail
Joe McKendrick

Joe McKendrick is RTInsights Industry Editor and industry analyst focusing on artificial intelligence, digital, cloud and Big Data topics. His work also appears in Forbes an Harvard Business Review. Over the last three years, he served as co-chair for the AI Summit in New York, as well as on the organizing committee for IEEE's International Conferences on Edge Computing. (full bio). Follow him on Twitter @joemckendrick.

Recommended for you...

The Rise of Autonomous BI: How AI Agents Are Transforming Data Discovery and Analysis
Beyond Procurement: Optimizing Productivity, Consumer Experience with a Holistic Tech Management Strategy
Rishi Kohli
Jan 3, 2026
Real-Time Analytics Enables Emerging Low-Altitude Economy
Digital Twins in 2026: From Digital Replicas to Intelligent, AI-Driven Systems

Featured Resources from Cloud Data Insights

The Difficult Reality of Implementing Zero Trust Networking
Misbah Rehman
Jan 6, 2026
Cloud Evolution 2026: Strategic Imperatives for Chief Data Officers
Why Network Services Need Automation
The Shared Responsibility Model and Its Impact on Your Security Posture
RT Insights Logo

Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.