For Alaska Airlines’ website one way to keep systems running is to break them using chaos engineering and observability.
Observability in the IT space is valued for its ability to help operations teams respond to incidents. Such problems range from poor response times up to complete outages. But observability also can be a useful tool when the technical team wants to create chaos and break things on purpose.
That’s what the site reliability engineers (SRE) at Alaska Airlines do with Sumo Logic’s observability capabilities when the team embarks on chaos engineering. They are breaking the website on test systems to learn how to respond when/if an incident hits the airline’s e-commerce website in the full production environment.
The combination of Sumo Logic software and Kubernetes supports the team’s Chaos Game Days. The experiments that they embark on with chaos engineering have already delivered benefits when things did go wrong in real life.
Bryce Lindsey, senior site reliability engineer on the SRE team supporting Alaskaair.com described the airline’s use of Sumo Logic, Kubernetes, Microsoft Azure, and chaos engineering in a breakout session, Alaska Airlines’ Observability Journey, during Sumo Logic’s recent Illuminate conference.
Like so many other organizations, Alaska Air has been moving its operations to modern architectures and intelligent systems. “Historically, we had the monolith that is the website running on servers in data centers,” said Lindsey. However, in the past few years, new applications have been hosted with Azure. In addition, and SRE team members who were “passionate about Kubernetes” started a shift to that open-source system to automate the deployment and management of containerized applications.
The website handles a wide variety of business functions, including customer-facing applications like flight search, booking, seat selection, and trip management. Some applications are internal and still hosted locally, while access to others is via service providers. Regardless of where an application originates or resides if the site goes down it quickly becomes a business-critical matter. A site outage may impact up to 10,000 guests outage during business hours.
The SRE team set up a response structure based on service-level agreements and the nature of an issue. If a non-critical issue pops up during non-business hours, it may generate an alert for the SRE team to follow up the next business day. It doesn’t have to be fixed in a rush at like 2 am. A product team handles more serious issues appearing in the off-hours, and the SREs tackle anything that arises during the workday.
Sumo Logic tools typically detect an issue within 10 minutes of it occurring. Depending on the nature of the incident, the time to resolution tends to be anywhere from a few minutes to several hours.
Introducing chaos and observability
So, how does chaos engineering keep the time for resolution tight?
What the SRE team discovered was that too often engineers were faced with fixing problems that they had never encountered before. Sometimes that was because a system or application was new, or it was new to that engineer. In either case, mitigation took time.
The SRE team wanted to explore the ways that a system could fail, the varied causes for failure, and how it could be fixed quickly.
Thus, the Chaos Game Day was introduced. During each two-hour session, engineers may run a dozen experiments. Each experiment involves a potential problem. Leaders ask team members to develop a hypothesis for what could happen and how the issue could be addressed. Then, the problem to be tested is injected into the system, and the team gets to see whether their hypothesis was correct.
“There are lots of different ways to break things,” said Lindsey, “The idea is that we break the systems or their dependencies on purpose to learn how they break, how they are impacted, and how they work.”
He noted that many reasons can cause issues, including a CPU outage, insufficient memory or a network issue. So, each experiment examined on a game day provides a glimpse into what could cause a problem. It also looks at the potential business and operations impacts, and how the break should be repaired.
Sumo Logic’s observability functions provide insight into how an issue could progress. Sumo’s Global Intelligence for Kubernetes Service suggests how to fix it. Lindsey noted that the chaos experiments occur on test systems, not in production.
Chaos in the real world
At the close of each experiment, the team develops a “runbook.” That document is to be kept on hand and used as a reference to recognize and address future incidents.
One experiment in the first game day involved a look at what would happen if the website’s Redis cache failed. Redis is an open-source in-memory data store.
“We started with taking down the Redis cache, and discovered that it broke the home page,” said Linsey. The team wrote up a runbook detailing what it looks like when Redis goes down, and the steps to take. “Three weeks later, it did go down in production. We went to the runbook and had it fixed in 10 minutes. Previously it would have taken hours.”
The Global Intelligence for Kubernetes Service also provides Alaska Air with a chance to optimize performance and ensure efficiency. The color-coded dashboards allow the SRE team to look at how resources are being allocated and used in Kubernetes.
The team examined the resources being used by the flight search application on a test platform, based on the previous eight days of activity. The dashboard showed the application had been configured to ask for 40% more CPU capacity than it needed. That represented a waste of money, according to Lindsey. “But we were asking for 40% less memory than we needed. So we were going to be having a bad day, but now we’re going to be good,” he added.