MLOps can be used to improve time to market and ensure ML models meet organizational, compliance, and end-user requirements.
MLOps stands for Machine Learning Operations. It is a core part of machine learning engineering, which involves deploying, maintaining, and monitoring machine learning models in production. MLOps is often a collaborative effort carried out by data scientists, machine learning engineers, DevOps engineers, and IT specialists.
MLOps can improve the quality of machine learning solutions. It lets data scientists and machine learning engineers collaborate more effectively, by implementing continuous integration and deployment (CI/CD) practices, together with monitoring, validation, and governance of ML models. The end result is to improve time to market and ensure ML models meet organizational, compliance, and end-user requirements.
Here are the important elements needed to deploy MLOps successfully:
- Adequate infrastructure resources—machine learning models need resources throughout their lifecycle. Additionally, these resources change as the model progresses from concept to later stages like development and production.
- Support for different ML model formats—an MLOps solution needs to be independent of details like the programming languages an ML model uses and its development strategy. After all, most enterprises use multiple languages and frameworks to develop their models.
- Support for software dependencies—an ML model will have multiple dependencies, more so if built on open-source technologies. The MLOps solution will need to support such dependencies and their version control.
- Monitoring models—ML models are trained on historical data, and when their environment changes, they need to be trained again. Hence, an MLOps solution should monitor models to ensure they don’t drift from expected behavior while in production.
- Ability to deploy anywhere—an ML model may need to get deployed in the cloud, on-premises, or on the edge. Hence, an MLOps solution must allow multiple deployment patterns so the production environment can remain flexible.
- Adequate data and governance—an ML model needs to have sufficient data to reach an adequate level of performance. Synthetic data is helping make larger datasets available without privacy concerns. In addition, an MLOps solution needs to provide sufficient data governance abilities so a model’s processes can gain the trust of businesses and regulators.
- Model retraining—an ML model must adapt to new data, and teams must ensure it doesn’t break because of the new data. Hence, an MLOps solution should allow models to get retrained on newer data while retaining the original algorithms, data pipelines, and codebases.
Data is crucial for machine and deep learning algorithms. After all, their prediction’s accuracy depends on how well the data is selected, collected, and preprocessed through methods like categorization, filtering, and feature extraction. Therefore, how data is aggregated from various sources and stored for AI applications significantly influences the hardware design.
Resources for an AI application’s data storage and computation power usually don’t scale together. Hence, many systems deal with both aspects separately. One example is systems dedicating large and fast local storage for each AI computational node to feed the algorithm. It ensures there is ample storage for executing the algorithm and driving the AI’s performance.
Machine and deep learning algorithms involve a great number of matrix multiplications and floating-point calculations. Moreover, such algorithms perform these calculations parallelly, similar to those performed in computer graphics applications like ray tracing and pixel shading.
While machine and deep learning calculations require high parallelism, they don’t need the same level of accuracy of graphics calculations. This makes it possible to reduce floating-point bits in their calculations to improve performance. Early deep learning research used standard GPU cards originally designed for graphics applications, but GPU manufacturer NVIDIA has recently developed data center GPUs specifically for AI applications.
Here are the system elements most crucial for AI performance:
|CPU||Operates the virtual machine or containers, sending code to GPUs, and I/O operations. Modern CPUs can also accelerate ML and DL inference. Hence, they are useful for production AI workloads that feature models earlier trained on GPUs.|
|GPU||Responsible for training ML and DL algorithms. It also often handles inference. Modern GPUs have high-bandwidth memory modules embedded, which are much faster than regular DDR4 or GDDR5 DRAM. Hence, a system featuring 8 GPUs has 256-320 GB of high-bandwidth memory.|
|Memory||Since AI operations mainly run on the GPU, the system’s memory isn’t normally a bottleneck. Usually, servers have about 128-256 GB of DRAM.|
|Network||AI systems are commonly clustered for better performance and have Ethernet interfaces of 10 Gbps or higher. Certain systems also have dedicated GPU interfaces supporting intercluster communications.|
|Storage Speed||The data transfer speed between storage and computation resources affects the performance of AI workloads. Hence, NVMe drives are mostly preferred over SATA SSDs.|
MLOps can be hosted both on-premises and in the cloud:
Cloud-based MLOps provides access to a variety of managed services and features. Leading cloud providers let you run MLOps processes in the cloud, providing the tools and computing power you need, without having to procure and set up hardware and build an in-house ML environment. Here are examples of services provided by the leading cloud providers:
- Amazon SageMaker is an ML platform that helps you build, train, manage, and deploy machine learning models in production-ready ML environments. SageMaker accelerates experiments with specialized tools for labeling, data preparation, training, tuning, and administrative monitoring.
- Azure ML is a cloud-based platform for training, deploying, automating, managing, and monitoring any machine learning experiment. Like SageMaker, it supports supervised and unsupervised learning.
- Google Cloud is an end-to-end, fully managed platform for machine learning and data science. There are features to help you manage ML services and create efficient ML workflows for developers, scientists, and data engineers. The platform enables fully automated machine learning lifecycle management.
On-premises MLOps requires the deployment of resources such as multi-GPU AI workstations in the on-premise data center. For large-scale AI initiatives, this may require software to enable the orchestration of clusters of computational nodes, such as Kubernetes.
In this article, I explained the basics of MLOps, and how it impacts organizations and their data centers. I described the key elements of AI hardware infrastructure:
- CPU – modern CPUs can be used to accelerate certain types of ML models.
- GPU – essential for running deep learning and some ML algorithms at scale.
- Memory – becoming a non-critical resource due to reliance on GPU on-board memory.
- Network – fast network connections are needed between GPU clusters.
- Storage – data transfer speed affects AI workload performance, requiring NVMe drives.
Lastly, I explained how AI infrastructure can be set up in the cloud vs. on-premises. I hope this will be useful as you plan the data center requirements for your organization’s AI initiatives.