There still isn’t a leading player in open-source AIOps tooling for DevOps. Here are some of the reasons why and some of the contenders.
The sheer complexity of today’s real-time, cloud-native deployments, from containers to Kubernetes, is a mixed bag for DevOps and SRE teams. On the one hand, microservices and event-driven architectures dramatically simplify much of their operational headaches. On the other hand, trying to keep tabs on the sheer number of moving parts can often verge on impossible.
In the past, DevOps and SRE teams often turned to open-source projects to build custom pipelines of observability data, dashboarding, alerting, and incident response tooling to get the job done. These tools have worked brilliantly to help them stay on top of thousands of pieces, deal with constant internal change, ensure there’s zero downtime, and adapt to all the new applications and integrations led by citizen developers.
But for some, there’s still too much to keep track of. Enter AIOps, or artificial intelligence for IT operations, as a new-fangled solution to a seemingly unsolvable problem. Is it the answer, and is there a solution for the open-source advocates on your DevOps or SRE team?
AIOps platforms are designed to combine observability data—think metrics, logs, and traces—with machine learning models and cutting-edge data science to help teams solve operational problems faster and with less headaches. That means building more robust incident response playbooks, reducing downtime, and driving down key metrics like the mean time to resolution (MTTR). Some solutions even promise to help predict incidents before they affect the end-user experience.
Ultimately, the goal of AIOps is to relieve DevOps and SRE teams of the burden of manual tasks and having to tirelessly respond to every possible incident out of fear they might be missing “the big one.” Instead, they can focus on building incident response playbooks, improving Kubernetes manifests, or reconfiguring public cloud resources to scale better or have failover solutions.
And it’s an appealing story. According to a 2020 study from Global Market Insights, the AIOps industry peaked at more than $2 billion in 2020 and is poised to grow at a 20% compound annual growth rate (CAGR) by 2027. At a $10 billion projected value, there’s plenty of room for startup unicorns and entrenched DevOps providers to battle over market share.
The people who make up DevOps and SRE teams tend to have a long history with an affinity for open source. Given that Linux itself, plus most of the tooling used to develop and deploy software and infrastructure, is founded in open source, they’ve probably used it professionally in the past. Some companies even insist upon using open-source tooling across everything they do for deployment, infrastructure, and incident response.
Core observability tools have many open-source contenders, like Grafana, Opstrace, Jaeger, Fluentd, Thanos, Cortex, Sentry, and others. There are also the OpenMetrics and OpenTelemetry projects, which are trying to standardize how metrics, logs, and traces are stored and organized to prevent vendor lock-in and simplify the process of integrating discrete platforms.
But there still isn’t a leading player in open-source AIOps tooling. seldon-core, an “MLOps framework to package, deploy, monitor and manage thousands of production machine learning models,” is a clear leader, but it’s only a platform for deploying the machine learning models that a company might use to implement AIOps. That means companies still need to implement the logic using one of the supported toolkits, like TensorFlow, Spark, or R. logalizer is a “machine learning-based log analysis toolkit,” which is capable of running automated anomaly detection based on logs but has far less reach. netdata includes a handful of built-in anomaly detection tools but only captures and visualizes metrics data, not logs or traces.
Another option is to build an AIOps pipeline using in-house development talent. That’s what Wells Fargo did, using a comprehensive pipeline using Apache Flink, Python Keras, Grafana, MySQL, and Prometheus. But given what we can assume is enormous DevOps/SRE resources at their fingertips, this kind of DIY AIOps solution is far beyond reach for most companies.
Why isn’t there more consensus around AIOps in open source?
The shortest answer is that effective AI/ML applications are extraordinarily complex, and open source isn’t the go-to place this burgeoning industry looks for answers. The data scientists and engineers with enough talent to solve these problems are likely building solutions for closed-source providers, and the requisite snowball effect to inspire open-source clones or alternatives just hasn’t picked up steam yet.
But an open-source contender—or at least a well-documented and easy-to-implement stack, like the ELK stack for analyzing log data—would satisfy many of the must-haves for companies who want to avoid going all-in on a single observability vendor. They could have access to tools to solve specific operational issues while still modifying or extending components as needed and developing the source code with a community, which widens their access to talent and innovative thinking.
It’s likely an appealing vision for DevOps and SRE professionals who want to dip into the AI side of operations while staying true to their open-source advocacy. And if the forecasts turn out true, we may only be a few years from a “gold rush” of open-source AIOps, driven by the ongoing need to wrangle increasingly complex infrastructure.