Addressing Modern Cloud App Problems with Observability and AIOps

Increasingly, the industry is migrating from monitoring to observability and solutions that use AI to assist in managing alerts and correlating incidents.

The complexity of modern cloud applications makes it hard to detect problems in the making and find root cause issues that impact the user. Traditional monitoring and tracing solutions produce a glut of data and alerts that can overwhelm operations staff and SREs. At best, the data they produce can be used after an incident to try to find the cause of a problem.

Increasingly, the industry is migrating from monitoring to observability and solutions that use AI to assist in managing alerts and correlating incidents. Recently RTInsights sat down with Phil Tee, Chairman, CEO, and Co-Founder of Moogsoft, to talk about common problems with modern cloud applications, the shortcomings of traditional tracing and monitoring solutions in detecting them, and the need for AIOps and observability. Here is a summary of our conversations.

RTInsights: In modern cloud applications, what are some of the most common problems that lead to downtime, service disruptions, and poor performance? 

Phil Tee, CEO and Co-Founder, Moogsoft

Tee: It’s been an interesting journey going from monolithic applications to the microservices intrinsic to the design of a modern SaaS platform. It may seem to be a trite thing to say, but common problems are often due to errors – and particularly unforced errors in either the design of the microservices or new code pushes that occur that can cause these issues.

When you think about how a service is composed of microservices and how an application works, you’re decomposing the functionality into independent operating units that interface with each other through a defined interface. The sort of edge cases that arise are very difficult to foresee and test for.

For example, in times of very, very high load, you might find that there’s one microservice that just can’t keep up with other services in the application as a whole. As a result, it starts to run slow. It’ll start to build up queues on a message bus, and the whole thing will eventually collapse. That’s often the case if you haven’t really considered how the application performs under horizontal scaling.

Ultimately one thing that hasn’t changed is that it’s the scenarios that you do not anticipate that catch you out. In a microservice-composed SaaS application, it is often the interplay between individual microservices that is at the heart of the issue. And the key, in terms of being able to deal with that, is high-quality observability of the application as a whole.

See also: Splunk and AWS Lead Open Cybersecurity Framework Effort

RTInsights: Why are these problems hard to detect using traditional tracing and monitoring solutions? 

Tee: I characterize the level of sophistication of most tools being used in the modern observability space as straight out of the 1990s. They gather some metrics, stick a threshold on those metrics, and generate an alert when the threshold is exceeded. The world’s just not that simple anymore.

There’s a requirement for these tools to get much more sophisticated in their anomaly detection and correlation. This sophistication is necessary for quality correlation because of the volume of data and also the complete absence of a definitive reference model of how everything interacts. As such, from our perspective, the problem with traditional solutions is the absence of AI in operations.

RTInsights: What’s needed to identify problems? 

Tee: In most cases, tools use the equivalent of high school statistics and maybe high school AI, in the sense of linear regression or something similar. This makes Identifying problems much harder than it needs to be. What is needed are AI algorithms, correlation techniques, and next-generation anomaly detection techniques. A solution also massively benefits from the normalization of the data, so you have context. Most particularly, the tool must be able to work on data in motion to detect and identify emergent correlations that, ultimately, downstream will turn into a problem.

RTInsights: How does Moogsoft help companies become more proactive and address problems in the making before they cause downtime?

Tee: You need a system that doesn’t do root cause analysis in a post-mortem mode.  It’s about being able to evolve your understanding of the state of the system in real time, as opposed to waiting for a complete set of evidence before you deduce that the world is on fire. That is why you need AIOps. There are very few of us that do it that way. Most of the tools that claim to do it that way do not; most of them are very similar to the platforms I built in the 1990s, but perhaps with the addition of SaaS delivery. It makes me sad to contemplate people replacing Netcool with essentially Netcool in the cloud – the outcome will be the same – downtime.

Leave a Reply

Your email address will not be published. Required fields are marked *