Observability for SREs, DevOps and IT operations allows teams to focus on developing better services with superior customer experience.
Observability with AIOps can complement work by site reliability engineers and IT service management while helping operators stop chasing tickets and start focusing on what matters to the business, customer experience.
IT service management (ITSM) was the conventional way to manage IT services. Yet there are more and more claims that, in the age of DevOps and Agile, ITSM is obsolete and no longer fit for its purpose.
There is some truth to this attitude. Some of the original assumptions for ITSM no longer are true. At least not in their original sense. Based on the way legacy applications and IT systems operate, there is a built-in expectation that a single incident relates to a single event. It’s a belief that one thing went wrong. The guiding presumption is that the thing that went wrong has an owner who is responsible for it. So the incident is assigned to that person to investigate and remediate the issue.
The old days of ticketing
During the last few years, and even since the last major review of IT Infrastructure Library (ITIL) in 2011, infrastructures have become more complex. They are still changing with increasing speed.
From a practical perspective, it is no longer true (if it ever was) that a single event describes an entire incident. With modern systems, it is rare for incidents to trigger from a single failure. Root causes are more frequently tied to multiple issues occurring together in some unexpected combination. This means there is no single owner of the resulting ticket
Solving the puzzle
With multiple operating issues at play, site reliability engineering (SRE), IT Ops and other teams should be quite concerned that there is no single owner of the resulting ticket. A major roadblock to investigation and remediation is that multiple incident tickets may be created and routed to different teams, who each receive only a single part of the puzzle.
It can take teams a long time to figure out how these separate records relate to each other, what the overall impact is, and who is ultimately responsible for fixing the problem. Quantifying this issue is often done with a measure called the Mean Time to Innocence (MTTI), which describes how long it takes to determine whether something is a cause or a symptom.
These issues make for a bad experience for the IT professionals caught up in the “catch and dispatch” process of trying to deal with a constant torrent of incidents — only some of which are real and their responsibility. But they also are bad for the users of the business services, which are the reason for all of this activity in the first place.
A modern approach to making tickets useful
Emerging approaches adapt ITSM to these new realities. One of the most interesting is “swarming.” Under that concept ad hoc virtual teams come together on the fly across organizational and technological boundaries to deal with incidents that intersect across respective areas of responsibility.
“It is rare for incidents to be triggered by a single failure. Root causes are more frequently tied to multiple issues occurring together in some unexpected combination. This means that there is no single owner of the resulting ticket.”
Observability with AIOps is designed to efficiently deal with this cross-functional challenge. It processes all telemetry and operational data with algorithms to filter the event storm, and to build a “meta-ticket” that encapsulates all the different aspects of a particular incident.
There is also a growing interest in two-tier models, with observability-based systems of engagement to complement the ITSM system of record. A system of engagement is more agile, flexible, and self-organizing. Therefore, it’s better able to deal with rapidly evolving situations, while the system of record is where everything gets documented and other processes pick up on that documentation.
Leverage new benefits of observability-based ticketing
Now is a good time to consider a modern iteration of ITSM using observability-based ticketing. There is a clear need to evolve capabilities that can continue supporting business requirements in the future. We must move beyond the legacy notion of ticket-driven incident management with a single owner and a sequential process.
By pairing observability with AIOps, your teams will be able to extend that model and complement it with other tools and techniques. As a result, organizations will receive better quality of IT service, lower cost of DevOps and IT Ops, and less impact on the personal lives of operational staffs from being constantly on call for incidents that all too often turn out not to be their problem.