Automation holds the promise to help mitigate and manage system incident issues in real time. But a new study finds that many organizations are still doing things manually.
It seems ideal to respond to any system glitches, outages, breaches, and other maladies in real time, if not prevent them altogether. Unfortunately, even in an age of sophisticated, self-healing, self-routing systems. predictive analytics, and automation, this is still not the case.
This is the word from a recent analysis and survey of 317 enterprises out of Constellation Research, which finds most responses to systems issues are conducted manually, resulting in delayed responses and fixes – to the tune of more than $100 million a year for some large companies.
Automation can bring incident management up to speed – but is not there yet. “Even more eye-opening, 49% of those incidents are straightforward and repetitive, and can be automated away,” the report’s author, Andy Thurai of Constellation Research, states. “Most enterprises today are not set up to handle IT-related incidents or crises in real time.”
Avoiding incident responses through automation “is a key success ingredient for many digitally savvy organizations, yet 46% of the respondents, despite knowing how to reduce incidents automatically without needing to involve incident responders, haven’t done it yet,” Thurai writes.
In addition, “very simple automation of known solutions for known incidents could reduce human latency by fixing incidents as soon as they happen,” he notes. “Surprisingly, however, only a third of incidents are automatically fixed (35%, to be exact). The rest of them go through the manual incident response process, increasing the workload of IT support teams.”
To put it simply, IT managers are overwhelmed by incident reports and alarms. “Continuing with older IT service management, IT operations management, and incident management practices leads to too many alerts, creating alert fatigue,” says Thurai. In the survey, more than half of IT managers, 57%, indicate that they get more incidents than they can handle. Still, even automation can go awry. Manual or human error (cited by 42%) and automation error (cited by one in ten) are two issues that cause major incidents.
“A sizable percentage of major incidents (as high as 43%) could have been avoided if incident resolution had been properly automated,” the report states. “In other words, not only is incident resolution automation required but also the automation of incident resolution has to be done the right way so it won’t introduce new errors.”
Incident management and most incident responses “are still shockingly manual,” Thurai observes. “Manual response does not scale. In fact, a majority of the respondents chase every possible alert they can, are alert-fatigued from receiving too many alerts, or don’t have a proper site reliability engineering (SRE) team set up that can automate things.”
Automating to address at least the most mundane incidents – such as misconfigurations – “needs to be done as soon as the first-of-its-kind incident occurs,” Thurai advocates. “There is a possibility the incident can repeat. If those incidents can be automatically resolved, the incident response teams can spend time responding to other incidents, unknown unknowns, that have not been seen before.”