RTInsights editor Joe McKendrick and Sumo Logic CTO and Co-Founder Christian Beedgen discuss the changing requirements for meeting the availability and reliability requirements of today’s demanding users and customers.
In this RTInsights Real-Time Talk podcast, Joe McKendrick, Industry Insights Editor at RTInsights, and Christian Beedgen, CTO and Co-Founder at Sumo Logic, discuss how new ways of thinking must be applied to reliability given the complexity of modern applications. What is needed is to move beyond traditional monitoring to observability. And to focus not on pure IT metrics but to look at the impact availability and reliability have on the customer experience.
About the Continuous Intelligence Insights Center:
From real-time fraud prevention to enhanced customer experience to dynamic energy load balancing, businesses of all types and sizes are realizing the benefits of Continuous Intelligence, helping them make decisions in real-time while events are happening.
Where do you begin? What are the key requirements? RTInsights’ Continuous Intelligence Insights center, sponsored by Sumo Logic, brings together the latest insights and advice on continuous intelligence to answer these questions and more.
Read the podcast transcript:
McKendrick: Welcome everyone to our RTInsights podcast series on continuous intelligence. This is Joe McKendrick, analyst with RTInsights, and I’m very pleased to have here with me, Christian Beedgen, chief technology officer and founder of Sumo Logic. He’s joining us with another session. We wrapped up the previous session if you tuned in on the ins-and-outs of digital transformation.
Christian, welcome back. Good to see you again.
Beedgen: Thanks again for having me.
McKendrick: When we talk about digital transformation, there’s a lot on IT’s plate. Reliability is a key factor. It’s been a challenge and concern to IT professionals since the mainframe days. With respect to the reliability factor, are we making progress? Are things better off than they were 40 years ago, even five years ago or two years ago? What’s the state of reliability in IT operations these days?
Beedgen: I think that the art has evolved quite a bit, but so have the challenges. This is one of those topics where things will never be perfect. Right? The way that I look at it is that, with digital transformation, business success is now predicated on applications through which the business is conducted.
Applications in today’s world are obviously delivered worldwide over the internet, 24/7, by and large delivered via a cloud. Cloud has made it possible to get to the point where you can reasonably believe that you can deliver a system that is actually up 24/7 and that is delivered at good service to many via edge. We can now use these applications everywhere in the world versus in the mainframe days.
Applications have become a lot more complicated because of this and elevated requirements around reliability and security. The art of keeping systems reliable is also catching up, and can help people understand culture changes, new roles emerging, and so forth. But as it were today, I think people are understanding that ultimately if your business is digital, digital experience matters.
First, the digital experience must be good enough. If it just plain sucks, then you will not be successful. There’s a minimum bar that you need to hit, but I would also argue that is not even good enough. Ideally, you would have turned your digital aptitude in terms of experience, into a competitive advantage.
I was an intern in the late 90’s at Amazon. What always stuck with me was this maniac customer focus that they had on how to create these experiences, which is entirely digitally in their case. That extended all the way down to the way the catalogs are designed, making it easy for you to buy. Ultimately, shipping it to you for free the next day. This type of stuff and all the micro things and many, many layers that lead to this being a pleasant experience. If it is not, you can just return everything. The digital implementation behind everything is very, very slick.
That’s the importance of the experience and turning it into a competitive advantage. So, the reliability of your system therefore important, but you also can potentially use this to be competitive.
So, if I’m Walmart and I’m trying to compete with Amazon and if my site is just flat out not reachable, which could be for all sorts of issues, including, I don’t know, God forbid for expiring SSLs, customers are just going to go back to Amazon or go to Amazon in the first place and never come back. So, reliability therefore is a large part of making sure the store is open.
McKendrick: I did a study a couple years ago of database administrators, the admins. We asked them personally, what has been your personal experience with eCommerce? Most of them came back and said if they’re stuck on a site too long or if the site hangs or if there’s too much of a delay in responsiveness, they’ll leave the site.
And these are the people that understand what’s going on behind the scenes. You think they’d be a little more sympathetic, but they are not tolerant of lag time. I heard somewhere else that it might have been about seven seconds. If you’re stuck seven seconds waiting for a response from an eCommerce vendor, it’s over.
Beedgen: And these people are probably not even the TikTok generation that we have now. I think seven seconds is, that feels like an eternity, but yes. I look at what is the next level of detail under reliability. So, there’s basically two things: availability and performance. Availability is about the ability to get to you. Think about this. It’s whether the site is up or down. But the reality is that it’s almost never that binary.
That’s also why it’s becoming more complicated. You are all over the world and there’s different networks, data centers, edge locations, and so forth delivering the experience. You might be up everywhere except for, I don’t know, in Southeast Asia. That’s still a part of the total availability picture. You must understand that there’s many more complicated definitions of availability, than just up and down.
Then of course, you don’t just run one site or maybe your site has different departments or different aspects of it. You almost never have a hard down. If you’re hard down, then that’s pretty bad. But that almost never happens, but you’re almost always down somewhere.
Even with tooling, understanding reliability, and monitoring these things just happen. It’s just like, suddenly there was a tornado yesterday in Texas. I just happened to move here and boom, we have a tornado. I did not expect that. Yet, there could be a tornado at any point in time somewhere in the US.
This might be stretching the metaphor a little bit, but there is the other part performance. That goes back to your seven seconds or DBA example, where performance is not necessarily, are you faster than everybody else. Performance in that mind is not necessarily a formula. It is, are you fast enough? Meaning, does your catalog switch come back fast enough? Otherwise, after a couple of seconds, people will just go somewhere else. A classic example is when everything works and then a customer wants to checkout. You need to hand the transaction over to a payment gateway that’s not under your control. If something goes wrong, then suddenly the thing is spinning for 30 seconds. At that point, you’ve probably lost a customer.
The complication here comes is that you don’t just have to monitor one path, especially if it’s a worldwide business. In this example, there’s different gateways depending on where you are in the world and so things just become complex. This is why I think ultimately, inventiveness and innovation will always outpace reliability and security. In the end, reliability is an art.
McKendrick: I love your terminology, that it’s an art, reliability is an art. A lot of people think of it as a hardcore engineering science type of a problem, but it’s an art. I love that. What are some best practices you would recommend to ensure greater reliability? What are some of the best approaches you see you’ve seen?
Beedgen: Generally, the art (in the past) has been basically to monitor resource consumption and use. Resource consumption being the load on the stuff like CPU, memory, and all these classic things. That would make sense if your app was just running on one server or system. But now your applications and services are getting delivered over hundreds or maybe thousands of random EC2 cloud instances or Microsoft Azure instances, or maybe some mix of all of that. Ultimately, user doesn’t care what you use. However, your monitoring system might care and wake up some poor engineer that then has to, in the middle of night, try to figure out what’s going on.
Going back to the practice of the art…much of this has evolved from the experience the web-scale people have had over the last 20 years. Google has publicly written quite a bit on site reliability engineering and how the way to monitor large, complicated sites is that instead of chase down every resource exhaustion or hot CPU, just accept a level of unreliability and apply it as SLI/SLO [service-level indicator/service-level objective] approach.
That approach is basically saying perfect is the enemy of the good. Things are too complex. You will not be able to maintain 100% of anything. So, figure out how to measure a bunch of things that are really important, basically have SLIs that are what we call customer centric.
For example, in the case of a catalog search or shopping cart checkout, basically say, “hey, we want this particular thing coming back within no more than 500 milliseconds, not a hundred percent of the time, but a hundred minus a very, very small percentage of the time, meaning 99.9 or 99.99.”
As you manage to hit that reliably, then get better at it. Try to boil the level of complexity down. Instead of having, say, 10 different things times a thousand instances of VMs that you need to monitor, boil it down to a bunch of customer-centric things and maybe a second layer where you go from Microsoft API, golden signals. But basically, simplify down to lesser things that matter more to what your customer’s actually experiencing versus what some mechanistic worldview would suggest you should do. Then accept that it’s going to be very good, but not perfect. Don’t wake people up in the middle of the night.
If you still have what’s called in this particular methodology, error budget lift, right, so okay. Over a 30-day window, I am okay with a very small percent of requests failing, let’s say. So, I’m not going to freak out if one out of a thousand requests fails. I’m just making these number up. That’s just the nature of the beast. I could potentially prevent that, but then I’m going to spend 10x or a 100x more on engineering on that, versus building new features and competing on the actual business value.
You basically just set the SLO at 99 point whatever it is. That gives you a rolling error budget. Once you start seeing depletion of the error budget, it’s basically, look at the curve and the lines in terms of consumption and acceleration of consumption, then you wake people up because the chance that something’s going on that matters is much higher.
So, this is monitoring evolving. Then they flip to that is what’s been becoming known as observability. There’s a little bit of confusion around the term because the whole monitoring space or APM space are now being thrown together. Analog management is starting to be called observability, but the narrower definition of observability is the one that I like a little bit better. It comes from control three, where basically you have a chance to infer the state of the system from external signals. Let’s say you monitor the SLO level. Now you need to make somebody up or team needs to get onboard, and they need to figure out what is going on. Then you need observability, meaning you need to be able to ask your production systems what is going on.
You have to be able to take hypotheses and turn them into analytical queries or workflows in the tools that you used to support that, like the one that we provide.
McKendrick: Great point about nothing’s perfect. I mean, AWS has outages occasionally and Apple was out the other day. So, nobody has 100% availability.
Beedgen: Existence is a house of cards. Mostly the cards are in the right place, most of the time. That’s about as good as it can get.
Even though we are coming to a point where we are philosophically thinking of a small acceptable range of failure, the stuff that people are building is still incredibly reliable. I mean, just look at the number of times your package arrives the next day. It’s incredible.
McKendrick: It’s amazing. I’m amazed by tolls, highway tolls. You have the transponder in your car, or it reached license plates now. I mean, just blinding. Split second.
Beedgen: Here in Texas, they have this barcode thing. It’s just a small barcode they put in your windshield. They don’t even do the license plate. It’s just a barcode, right? Well, the reality is, if you don’t have the barcode, they’ll go off that license plate and send you a nasty gram.
There’s some pretty amazing stuff happening there. Again, the backend systems are systems like the ones that we’re talking about where you have the backend for in the cloud. That’s IoT on a very large scale, but all the backend processing happens on a bunch of VMs in some cloud, I
McKendrick: Well, as Arthur C. Clark said, at one point any sufficiently advanced technology is indistinguishable from magic. With that, I just want to say, thank you, Christian. Thank you very much for joining us in today’s podcast. Great speaking with you and learning from you. Be sure to join us in the third podcast of our series, Christian and I will be talking about security in today’s age. Thank you.