Reducing Complexity to Increase Availability

Business in the cloud is dynamically changing and any disruption in services has an immediate effect on customers. In this video, industry analyst Helen Beal interviews Phil Tee, CEO of Moogsoft, who talks about reducing complexity, providing early problem detection and root cause analysis for outage avoidance, and automating problem resolution so the MTTR is at a minimum.

Beal: Hello, and welcome to this RTInsights video with Phil Tee, CEO of Moogsoft. Today, we’re talking about availability because business in the cloud is dynamically changing and any disruption in services has an immediate effect on customers. With so many offerings in the cloud, it’s really easy for customers to switch based on their satisfaction. So Phil, tell us why is it important to minimize MTTR? And is it more actually MTD, or the mean time to discovery, that’s reduced? Or is it both that and the mean time to resolution?

Tee: It’s a very good question, and actually, in some circumstances, it depends upon the type of failure, which ends up being sort of the bigger part of the availability restoration job. So, ultimately MTTR is, or the mean time to resolve, is the end-to-end time cost of fixing an availability threat, and really that’s what has the biggest impact on the customer. But kind of nested inside of MTTR you’ve got, can I detect what’s going on? Can I work out who is responsible for resolving the problem? And when can I resolve it?

I guess there’s also the postmortems and the communications around that. And in actual fact, Moogsoft’s Ai Ops and our SaaS products is actually somewhat unique in that its straddles pretty much all of those stages in the problem resolution life cycle, and so we are sort of uniquely placed to help ultimately shape MTTR, which is where our customers effectively get their payback from the product.

Beal: Beautiful. And if we focus maybe on discovery, that one early bit, to begin with, in terms of detection, how early is it possible to detect?

Tee: Yeah, that’s a really good question. I kind of want to make a sort of a blanket excuse for the entire industry. It’s a really tough problem to solve, by the way. There’s a kind of data reduction and discovery problem, which is enormous by its nature. Modern applications and applications infrastructure generate a lot of data. Kind of certain scale, lots of data. Terabytes and petabytes of log messages, events, metrics, and so on and so forth. And what you’re hunting for is you’re hunting for the early indicators. One thing we’ve always been extremely keen on is data in motion rather than at rest. So, in other words, don’t wait until you have all of the evidence, if you will, to diagnose what the problem is and where it is. Try and have an evolving view, a time resident view, if you will, of the evolution of an incident.

And that’s also pretty unique and requires some significant bending of AI and machine learning algorithms, which is why we maintain an at-scale science and research division at Moogsoft. Because like I say, the problem is pretty difficult. But here’s the rub of it. We’ve had customers that think we get it so early that we’re actually being predictive. So, compared to the Legacy Systems, we may be 12 hours in advance, being able to pick up the real sort of pre-history of the evolution of an instance. And it’s also why we soak in on the convergence play as well, by which we mean handling metrics, time series, logs, and traces in the same platform as events and alerts. Cause it pushes you earlier and earlier into the problem of detection timeline.

Beal: So, is it fair to say that you’re effectively seeing patterns from the past being repeated again in the present that you know will lead to an outage? So you are basically predicting there’s going to be an incident?

Tee: We are, but we’re also doing something akin to the science fiction B movie from the 1950s. You don’t need to have seen an alien landing in Central Park to know something unusual is going on. So our algorithms also are very good at detecting novel, unusual patterns in the data and flagging and alerts early in the life cycle as well. And I had to, of course, get an allusion to “The Day the Earth Stood Still.”

Beal: Lovely. You mentioned this kind of concept of evolution and this idea that lots of things or it felt like lots of things are happening over time that are leading to something that ultimately becomes catastrophic in some way of the customer experience. It made me think about the complexity of the world that we lived in, and you talked about that certain level of complexity around data, but can you describe in a little bit more detail the difficulties that people are facing in their environments today and what those different levels and those at complexity are?

Tee: Yeah. I mean, when I started out in this industry when you learned a new language or an application development paradigm. Famously, you always get to the Hello World problem first. So whether you’re learning C or Java or whatever, Hello World is a pretty simple, basic program that you can type out in 30 seconds in a way you go. You’ve got your first application running. To do “Hello World” as a cloud service, you are talking about layers and layers of coding. So yes, you’ve got to write the application. Then you’ve got to have the image for the container it runs in. Then you’ve got to have the service definitions in Kubernetes, which usually involves something like a ham chart. You need the Terraform behind the scenes to layer out all of the infrastructure.

I mean, by the time you’ve done the sort of the minimum viable hello world, it’s a pretty significantly complex, multi-layered system to deliver that SAAS experience of the Hello world to the punitive customer. So it is incredibly more complex. It is one of the reasons why we decided some years back that we wanted to extend our reach into this space because we originally got off the ground because scale equals complexity in the larger organizations, but actually, scale doesn’t equal complexity in the SAS world anymore.

Complexity is kind of baked into the cake as it were, And it is just, it’s just a feature, and it’s a huge challenge. It’s also a huge opportunity because people don’t do complexity for no reason. The complexity is there to give you all of the benefits of horizontal scalability, reliability, ease of use, ease of control, the different engineering paradigm, et cetera. I dare say, but ultimately it’s part of the backdrop.

Beal: I think in our world as well, because the nature of our systems is generally that they’re quite invisible, and we rely on monitoring tools to tell us we can be getting alerts from several systems that would indicate that the problem is coming from several systems, but it may not be, it may be impacting different systems. So what can you do to help us manage those levels of sophistication in our systems?

Tee: I mean, it comes down to the many years we’ve spent honing techniques of correlation ultimately in our platform, and you’re quite right, Helen. I mean, ultimately, failure is not a single individual event anymore, and indeed, I mean, it’s kind of, again, part of the culture of building SAS Microservices is you engineer in failure. You assume that things are going to fail, so oftentimes, what we really mean by that are degradations in service rather than complete sort of capitation of the service.

But ultimately, you need to be able to pull the signals from a very wide range of different sources and be able to spot the commonality. In essence, the core of our correlation tends to be all about being able to spot the commonality across very diverse platform formats. Very diverse types of data of diverse sources, so you’ve got to be able to be wide, and you’ve got to be able to be flexible enough to be able to spot those similarities.

So automation, there’s sort of two halves to automation [inaudible 00:11:34] and of course, one of the great pleasures of having moved to the, I kind of live in the automation nation really. I mean, this is kind of where it all sort of came to be with the mass production light and on and so forth, but they’re kind of the two halves of the automation. One is kind of robotizing the response. So instead of having to have somebody go and log into a box and do some stuff or reboot this, or commission this bit of storage or whatever it is.

Having the execution framework that goes and makes all of that happen from a single or very small restricted number of, kind of push buttons that kick it off. That’s one-half of it. We don’t, to be clear, specialize in the robotization side, but the robotization side is highly dependent upon how accurate the button pushing is on the initiation side of it.

That’s where we play. So if you think of, forget about automation notification systems, where somebody gets a page at the other end of it. If what you’re doing is you are kicking off that page for every alert that you get into the system, you get page fatigue. If, on the other hand, you’re doing it when you get a correlated instant, which may capture tens, hundred thousands of alerts. Tens of thousands of events, millions of metrics behind it, you are funneling that down into a much more actionable initiation point for the robotic part of automation.

Beal: Fabulous. Thanks so much for your time today, Phil. I think the audience has learned a lot about how to achieve high availability in sophisticated environments that are continually evolving. So thank you very much.

Tee: Excellent. My pleasure.

Featured Resource: How Total Experience Will Drive Availability in 2022 [Download Now]