Summit speakers discussed the open-source data ecosystem and its role in modern businesses.
Data infrastructure is critical as data volumes continue to explode and as businesses try to get more value and insights out of their data. Open-source technologies and solutions continue to play ever-important and essential roles. These and other themes were the focus of the recent Open Source Data Summit.
According to Onehouse, one of the summit’s sponsors, the live virtual event attracted thousands of registrants from around the world and included more than 30 speakers.
Onehouse Founder and CEO Vinoth Chandar kicked off the day with a keynote address that provided an overview of the role of open source in data infrastructure. Chandar discussed the history of open source and provided an overview of the different tools and technologies in the open data ecosystem, including databases, data lakes, data warehouses, stream processing, and more. Chandar emphasized the need for a thoughtful strategy when adopting open-source data solutions and highlighted the challenges and considerations involved. The talk concluded with a discussion of a blueprint for an open data architecture that offers flexibility, interoperability, and control.
The keynote presentation can be viewed on-demand here.
See also: Open Source and the Data Lakehouse: Apache Iceberg and Project Nessie
Industry leaders talk open-source data
The one-day summit included speakers from Netflix, Uber, Walmart, LinkedIn, Tesla, Wayfair, Google, Microsoft, and more.
One particularly interesting session covered OneTable, a new open-source project that “unlocks omni-directional interoperability between the popular lakehouse projects Apache Hudi, Delta Lake, and Apache Iceberg.” Speakers included Ashvin Agrawal, Senior Researcher at Microsoft; Tim Brown, Engineering at Onehouse; and Anoop Johnson, Senior Staff Software Engineer at Google.
According to the speakers, OneTable offers lightweight conversion mechanisms that can take a source metadata format and sync it into one or more target metadata formats. The session featured a live demo, and participants described how to build open data foundations that could accelerate workloads into a variety of open-source query engines, including Spark, Presto, Trino, Flink, and more. The session is available on-demand here.
Other sessions included talks by:
- Jordan West, Staff Software Engineer at Netflix, on the practicalities of deploying open-source databases.
- Patrick McFadin, VP of Developer Relations at DataStax, on A petabyte-scale vector store for generative AI.
- Ankur Ranjan, Data Engineer III, and Ayush Bijawat, Senior Data Engineer, both from Walmart, on enabling Walmart’s data lakehouse with Apache Hudi.
- Tun Shwe, VP of Data at Quix, and Jay Clifford, Developer Advocate at InfluxData, on data plumbing basics: Build, deploy, and scale ML models for your time series data.
- Nishith Agarwal, Head of Data & ML Platforms at Lyra Health, on making decisions that are right for your data platform.
- Siddharth Jain, Senior Engineering Manager at Wayfair, on options for real-time data pipelines.
In addition to these and other sessions, there were several panel discussions sprinkled throughout the day. One focused on batch, streaming, and real-time data processing for ML, with speakers from Eastern Bank, Intuit, and Tecton. Another examined the growing role of open-source technology in today’s data architectures. There were speakers on this panel from Onehouse, Microsoft, Confluent, LinkedIn, Starburst, Uber, and Google.
A complete list of the sessions and panels, all of which are available on-demand, can be found at the summit’s website here.