Observability isn’t an additional feature. It’s not even a non-functional requirement. It’s a core architectural tenet and it is testable.
Observability is a characteristic of systems; that they can be observed. It’s closely related to a DevOps tenet: ‘telemetry everywhere,’ meaning that anything we implement is emitting data about its activities. It requires intentional behavior during digital product and platform design and a conducive architecture. It’s not monitoring. Monitoring is what we do when we observe our observable systems and the tools category that largely makes this possible.
Since it requires intention, it relies on the human and cultural aspects of an organization for implementation.
DevOps ways of working lean heavily on cultural aspects, and one of the newer, emerging aspects is that of observability. To understand this, it’s best to think about what a culture without observability looks like. That means systems are not designed to be observable, and therefore people can’t see them. People are likely not set up to be sharing knowledge and collaborating on problems and opportunities. Perhaps behavior reflects that “knowledge is power,” and there are heroes where there should be mentors and leaders. When conversations are had, they are opinion-driven, not data-driven, and decisions are made accordingly. Being able to see the outcome of those decisions is going to be hard, as they aren’t able to access data on customer feedback.
In an observability culture, engineers design and build systems so that they are inherently observable. They do this because they know they are going to want to use and share that data to achieve outcomes such as resolving incidents faster and making the best choices about what features to develop or what technical debt to address.
Observability isn’t an additional feature. It’s not even a non-functional requirement. It’s a core architectural tenet and is testable. Indeed, testability is reliant on observability.
We build it; we own it
Another characteristic of a DevOps culture is that its hierarchy is flat; that is, authority has been distributed, and teams are autonomous, as addressed in the previous post in this series. Autonomy means that the team chooses their work and how to do it and also that the team contains all of the skills and knowledge to take an idea in their product all the way through the development and delivery cycle and receive feedback directly from the customer in real-time to feed into the next cycle of enhancements. The team is multifunctional, and team members are multiskilled. Job titles blur as individuals carry multiple roles and morph in and out of them as work priorities demand and change.
Amazon’s CTO, Werner Vogels, famously said back in 2006, “you build it, you run it,” which was groundbreaking at the time for challenging the boundaries between development (build) and IT Operations (run). As DevOps has succeeded in creating teams that own the end-to-end value stream or product, the advice can be updated to create a heightened sense of ownership and accountability; the change to ‘we’ and ‘own.’
While we often think of these autonomous teams containing development, testing, and IT operations expertise, they also contain product owners or managers, user experience, and business analysis roles.
Total Cost of Ownership (TCO)
Product managers have long been synonymous with ‘mini CEOs reflecting how a product is a micro-business in its own right. It needs strategy, vision, execution plans, implementation management, marketing, continuous improvement, and investment to achieve all of this. It might be self-funding, but a good product team should be managing its own profit and loss (P&L). They need to know what it costs to run on a monthly (at least) basis and have real-time data on income. They need to know the total cost of ownership of their product.
Income can be affected very quickly by problems and incidents. Downtime can interrupt transactions that may never be retrieved or regained. It also causes customer dissatisfaction and reputational damage, stymying repeat business, referrals, and new customer acquisition later down the line. Poor customer experience as a result of defective or low-quality digital products has the same effect.
On the other side of the same coin, plowing costly development efforts into something the customer doesn’t want, need, and doesn’t ultimately use is a waste of money and resources. Time to market means nothing; it’s all about time to value or time to learning. It’s not the speed a new feature is delivered; it’s the speed at which a positive customer-driven value outcome is achieved and the receipt of the related feedback that matters.
To most effectively gain feedback (and be adaptable) from customer experience and desires, teams must continuously deliver. They must break work down into small components and release them quickly and monitor feedback. They practice continuous integration; all developers commit at least daily to trunk, and builds are automated along with unit, integration, and user acceptance tests. These activities are treated as experiments, with benefits hypotheses that are tested. Practicing continuous integration allows for continuous delivery; teams can release new features as customers demand them (pull, not push).
Working this way is not suited to heavy-weight annual budgeting cycles; they are for large batch, project-oriented ways of working. And anyway, these are long-lived teams working on long-lived products. And, remember, they understand their P&L and their TCO.
If a business is truly agile, all parts, including financing, are working in small-batch modes, with frequent feedback loops. These product-oriented, value stream teams should be continually funded, also known as capacity funded. If this seems like too much of a chance, choose some pilot teams and start with a halfway house, rolling quarterly funding waves.
Capacity-based funding distributes available funding across persistent, self-organizing teams. Funding is allocated based on the team’s delivery capacity and the resources required to deliver the solution or product.
Capacity-based planning is performed at the team level, where teams prioritize work within the sprint and estimate hours available to commit to a specified set of activities. This method focuses on understanding the scope of products and future conditions to develop more accurate release schedules. Teams continuously re-estimate to ensure delivery predictability.
As established, DevOps culture is one of experimentation. Not all experiments adhere to their hypothesis – and that’s the point. Even ‘failure’ is a learning opportunity. More important to agile ways of working than speed, more than adaptability is the management of risk. Companies should try and test small ideas before committing more time and funding de-risks effort.
There’s a great story from the UK’s intelligence agency, GCHQ, about how they crowdsource experiments and direct funding. It’s a great example of why we need autonomous teams with direct access to customer feedback:
“It is likely that those front-line staff are much more in touch with latest ideas in technology than directors who have been at the organization for 30 years. If the iceberg of ignorance is accurate, those directors would only be aware of 4% of problems compared to 100% for front-line workers.”
Experimentation mitigates risk AND accelerates innovation.
How do teams know how their experiments performed? How do they keep themselves safe during experimentation? Observability, of course.
Investment cases for observability
It’s difficult to directly attribute value to observability as it’s not a feature for customers to utilize. It’s a capability for engineers to leverage. It’s a pre-requisite for AIOps, though, so is the foundation for making data accessible.
As established, observability should be a cultural and architectural imperative for teams. Everything should be built with it in mind. It’s a non-negotiable non-functional requirement like performance, security, and stability. It’s an enabling factor.
Investment cases for AIOps
Teams can implement observability and monitoring and receive benefits in reduction of time to discover and recover from problems and incidents. But with monitoring systems and data proliferating rapidly, this is hard for humans to do effectively without a little help from our machines. AIOps supercharges product support and maintenance. Here’s how.
Consider this use case for incident management for AIOps and some guidance on writing a business case:
- Your volume and length of incidents is the starting point
- Then work out the cost of an incident in terms of loss in transactions
- Also, apply a cost to the loss in repeat business, referrals, and new customers as a result of reputational damage
- Also, work out how many working hours are spent on incidents and the cost of resourcing them
- Now decide how much time you will save on average per incident on discovering/detecting and resolving/recovering (MTTD/R) – estimate as a percentage or run a pilot and get some real data
- Calculate savings on transactions that are no longer lost and people’s time that’s no longer used on this unplanned work
- Estimate the gain in repeat business, referrals, and new customers not lost
- Now express that this time saved can now be spent working on new features that will deliver valuable outcomes for your customer and business