ODSC brought renewed attention to the need for operationalized data processes when using advanced analytics, machine learning, or artificial intelligence.
For years, most efforts involving sophisticated analytics, AI, or machine learning were like lab experiments. They were overseen by data scientists and data engineers, and each undertaking was a one-off endeavor. Numerous talks at the recent Open Data Science Conference (ODSC) in California point to those days being over.
While there were many data science aspects covered throughout the conference, what emerged (at least from my perspective) is that data efforts that were done as one-time projects in the past are now considered essential. And as such, they must be production-ready and developed in such a way as to address security, performance, and reliability from the start.
In fact, a sampling of the talks shows organizations are adopting and implementing things like DataOps, MLOps, and more. Such efforts, by definition, intimately tie operational aspects to any project from the get-go. Some notable examples include:
ODSC: “DataOps 2.0” – How the Changing MLOps Landscape is Reinventing DataOps
Speaker: Rajsekhar (Raj) Aikat, Chief Technology & Product Officer | iMerit Technology
This session covered how commercial deployments of MLOps at a scale are going through an evolutionary change across almost all verticals and use cases. Notably, deployment and product at a scale are seeing a whole slew of commercialization issues arise as “edge cases” become more significant at scales of thousands, millions, or even billions.
As a result, “traditional pre-deployment training” is no longer sufficient to ensure quality and reliability standards in production, especially in edge devices. Personalization at the edge – cards, phones, cameras, etc., need to be continuously trained with custom and specific data relevant to their individual usage or environment to be safe and accurate.
The speaker noted, for example, the neural networks that govern the way an autonomous vehicle drives in New York City is not the same in Kansas City or Mexico City, and surely not in Yellowstone National Park. The explosions of data streams from billions of “sensors” in real and virtual worlds need to be constantly processed for retraining models, and data preparation is getting extremely complex as use cases strive to replicate normal human behavior.
The ”Ops” angle here relates to both DataOps and MLOps. The speaker noted that key KPIs that are fast emerging as critical for almost all ML applications, such as data preparation and annotation (DataOps), are no longer a “one-way street” process feeding the MLOps cycle but are intricately fused to it at multiple stages.
As such, technology’s role in DataOps and MLOps has become more important than ever before. A robust workforce, project, task, and data management platform is one of the most critical pieces to cater to the scale.
ODSC: Why you can’t Apply Common Software Best Practices Directly to Data Workflows, and What you can do About it
Speaker: Anna Filippova, Director, Community & Data | DBT
This session looked at the specific challenges to adopting software engineering best practices for data and analytics workflows, why they exist, and how data scientists can craft environments to best address common pitfalls and encourage reproducibility.
Why is all of this needed? The speaker noted that every day, virtual mountains of data are collected and stored at unfathomable speeds. As data volume grows exponentially, the data workflow becomes more complex as an avalanche of data makes it challenging to identify, cleanse, mine, pivot and use it for both insights and AI-powered product features.
To derive the most value from their data, data professionals must be able to set up their workflow in a way that will maximize not only their own efficiency and productivity but also data reproducibility. To do this, data teams borrow a lot of best practices from software engineering, like testing, version control, documentation, and continuous integration and deployment (CI/CD), but there are important differences in how these are implemented with data workflows that hamper the success of data teams.
The session also covered specific actions leaders can take and offered real-life examples and use cases. Attendees got a deeper understanding of how to avoid common pitfalls and how to improve team collaboration and reproducibility in data workflows.
ODSC: Cloud Directions, MLOps, and Production Data Science
Speaker: Joe Hellerstein, Ph.D., Jim Gray Professor of Computer Science | University of California, Berkeley
This session focused on how cloud computing promises to simplify infrastructure, but somehow MLOps remains deeply technical, even in the cloud. The speaker noted that the complexity of MLOps tends to lead to an organizational antipattern: data scientists who know the data and models best have to mind-meld with data engineers who know the infrastructure best. This is particularly problematic in the highest-value stage of the ML lifecycle – managing models in production.
The speaker noted that recent trends in cloud technology, including serverless computing, promise new approaches for abstracting away infrastructure. Unfortunately, these offerings fall short of the challenge of MLOps. Dr. Hellerstein then covered some of the important promises and weaknesses of current cloud offerings and described research from Berkeley’s RISElab and the resulting open-source Aqueduct system, which is putting Production Data Science at the fingertips of anyone working with data and models.
A final word from ODSC
These and other ODSC sessions brought renewed attention to the need for operationalized data processes when using advanced analytics, machine learning, or artificial intelligence.