To Batch or To Stream–That Is the Question of the Day

PinIt

Claypot AI founder, Chip Huyen, designs machine-learning systems that combine streaming data and batch processing to accommodate multiple data types.

As part of our media partnership with Tecton’s apply(conf), which took place earlier this year, RTInsights had the opportunity to speak with Chip Huyen, author, Stanford instructor, and startup founder, about her apply(conf) session, “Machine Learning Platform for Online Prediction and Continual Learning” and the path that led to this subject matter expertise. Chip Huyen recently wrote “Designing Machine Learning Systems” (O’Reilly, 2022) and has authored four bestselling books in Vietnamese.

Note: This interview was edited and condensed for clarity.

RTInsights: How did you get from writing Vietnamese books to becoming a machine-learning systems designer?

Chip Huyen: I started with a background in writing and then in college started taking computer science classes. At Stanford, it’s almost impossible not to take computer science courses. My friends told me that since computer science is part of the fundamental requirement there, I should just take a course as soon as possible to get it out of the way and enjoy it.

When I started at Stanford, I thought I would be a creative writing major, definitely not a computer science major. Then I took that first course, and I was very lucky because the professor actually inspired me to become a teacher. His course totally changed my mind about computer science. He actually made it fun. So I decided to become a computer science major.

In Quora, someone asked the question, “who do you think is the best professor in the world and why?” I wrote about my first computer science professor, and the answer got around a million views.

RTInsights: When you were taking those courses and for your major, did they include data science and statistics? How did you end up in machine learning specifically?

Huyen: My first class was a general introductory course, but we learned by creating games which was really fun.

A few classes were computational, which suited me as I had a background in math (I actually was on the math team in high school). Statistics and probabilities are really fun. I took a systems class and a database class, and these were miserable and so hard. That’s when I knew that I was not going to be a database person.

An AI class changed my mind about data. I really think my perception of the classes was due to the teachers and how they approached the subject. Since AI was very interesting to me and seemed easy enough, I decided to go with it.

RTInsights: We’d like to hear more about your take on streaming data and batch data. It seems to us that the technical skills for dealing with batch are really different from those for streaming. Analyzing streaming data is still the subject of a lot of academic research. Is there a way to bridge the gap?

Huyen: You do need both. The difficulty with streaming is that it’s still new and we don’t have enough tooling for it. A well-designed platform should abstract away the complexity. You should not have to worry about how you get the data, just that you get the best data for what you’re trying to accomplish.

The fundamental difference between the two is that batch data usually refers to data that is addressed and that is collected in a data warehouse or some other storage system. The streaming data is data in motion.

One method to access both kinds of data is to dump the streaming data into a data warehouse. Even if you’re able to dump it every hour or every ten minutes, and it’s super fast, you are still waiting for data. Realistically, people are still collecting their streamed data on a daily basis, so there are a lot of businesses waiting for data.

Some businesses tap directly into social media data that is fresher and newer but stay away from actually streaming data because it’s so hard. The ideal would be to have it be equally easy to do streaming and batch.

That’s what Claypot AI does. The software handles the two types of data feeds in the background and gives the user access to all of the data. We can also provide access to streaming data in almost real-time. The business doesn’t have to collect streamed data on an almost batch-like schedule. Usually, batch and streamed data have to be accessed with different tools. We make that go away too. The business user doesn’t have to choose between types of data or the tools to analyze it.

RTInsights: Claypot AI sounds like it is bridging the divide between the streaming world and the batch world. It’s going to make the end user’s life easier, but it’s also taking on some important ML engineering work. What advice do you have for someone starting out in ML engineering?

Huyen: They should be going into crypto instead [laugh]. It’s interesting how quickly the field got so crowded. Three years ago, no one was talking about MLOps. Two years ago until even now, people have different definitions of what it is.

I think people don’t really understand who’s on an MLOps team or who actually does MLOp. The field is definitely still in flux.

What I am seeing is that a lot of people are trying to get ML skills. They think the quickest way is to follow the tutorial approach. There are examples of how parts of the workflow are done, and they learn templates that they think they can just copy and paste.

The problem with this approach is that you might not have a clear understanding of the problem you are trying to solve, and you certainly don’t know why you are doing things a certain way.

If you learned MLOps through online tutorials, you’ll be fine as long as things are running smoothly. But you don’t have a good basis for complex problem solving. To me, it’s very important to be less focused on technology and be more problem-oriented. To find that, you have to look through a lot of resumes.

The problem with technologies is that they get outdated very quickly. Just having TensorFlow on your resume isn’t enough. A good interview question is, “How would you organize a project to address a specific problem?”

Another misunderstanding is not realizing that machine learning engineering is mostly engineering and very little machine learning. A lot of people interested in MLOps might take some machine learning classes but lack a deep engineering background. If you want to be successful in MLOps, focus on being a great engineer.

RTInsights: What advice would you have for data scientists working on small teams that are tasked with bringing a project into production and might not have those kinds of engineering skills?

Huyen: That is a tough question because the answer depends on what kind of infrastructure exists and what stage their company is at.

Companies that developed data engineering approaches have a very different set of tooling, a different platform from a company that’s just adopting machine learning now.

Everything is changing so fast, and this might be one case where the first movers are actually at a disadvantage because they had to develop some of their own tooling in-house or adapt platforms that weren’t designed for the purpose they are used for.

RTInsights: What do you see looking forward? What will be some of the important trends in data and analytics?

Huyen: People are getting more comfortable with working with real-time data. And a lot of that work is going to migrate to the cloud. When a workload runs in a data center, it can scale from using five servers to a thousand and then scale back down on a cloud. Having this flexibility to manage fluctuations and be able to stay on the same compute platform regardless of the needs will have a tremendous impact.

Some of the companies I’ve talked to have a completely separate platform for fast access to data from their more traditional data warehouse.

RTInsights: What are some of the use cases for the fast or real-time access to data?

Huyen: Fraud detection is probably the largest one–not only detecting fraud immediately after a transaction occurs but predicting when fraud is likely and having the ability to mitigate it or cancel a transaction are very powerful.

Another important use case is dynamic pricing, which is optimizing a price that reflects the context in the moment. That’s what Uber, Lyft, and Airbnb are able to do.

Another is having more sophisticated recommendation engines. Currently, these work on historical data but access to fresh data, the current context, will increase their accuracy immensely.

View Chip’s apply(conf) talk here.

*Lisa Damast contributed to this article.


Read the rest of the series:

6Q4: Demetrios Brinkmann, on the role of community in solving MLOps’ greatest challenges

Amplify Partners’ Sarah Catanzaro on the evolution of MLOps

apply(conf) puts everything on the table to help data and ML teams succeed

Elisabeth Strenger

About Elisabeth Strenger

Elisabeth Strenger is a senior technology writer at CDInsights.ai.

Leave a Reply

Your email address will not be published. Required fields are marked *