Observability and Organizational Design

As the world embraces digital transformation, businesses must focus on organizational performance based on observability.

Observability is a characteristic of systems; that they can be observed. It’s closely related to a DevOps tenet: ‘telemetry everywhere’, meaning that anything we implement is emitting data about its activities. It requires intentional behavior during digital product and platform design and a conducive architecture. It’s not monitoring. Monitoring is what we do when we observe our observable systems and the tools category that largely makes this possible.

Since it requires intention, it relies on the human and cultural aspects of an organization for implementation.

Featured Resource: Observability with AIOps For Dummies [Download Now]

Project to product

Global digitization puts us firmly in the middle of a technological revolution and, according to Carlota Perez, on the brink of a golden age of prosperity. All around the world, organizations are racing to transition from waterfall (project-centric) ways of working to agile/DevOps (product-centric) ways of working. This demands that we change the organizational structure from layers and silos to an organic, cellular design constructed of multifunctional, autonomous teams. That means those teams contain all the skills needed and are empowered to develop and deliver an idea for their product to their customer and service them. Getting this right directly correlates to organizational performance; it’s the difference between being a disruptor or being disrupted.

When we treat our work as projects, our non-functional requirements often fall to the wayside. We frequently think more about ‘the wedding than the marriage’; that is we’re more concerned about the go-live date of the project than the years or decades of maintenance and enhancements our new capabilities will need to keep them serviceable, keep them providing the sublime customer experience that keeps us ahead of our competitors.

But when we’re working with long-lived products, we’re more careful about thinking ahead. We’re better at controlling our time and assigning space in our sprints to paying down technical debt and applying higher-level capabilities like observability.

Conway’s law

System design is informed by the communications structure of our organization, says Conway. Organizational design (or the communication structure of an organization) and the systems it builds are intricately connected. When an organizational design is multilayered, with tightly coupled teams, handing work and data off to each other, so is the system. It’s a monolith. Conversely, when an organizational design is made up of small, loosely connected parts (just as described in the previous section) so is the system – it’s microservices and API driven. APIs mean small parts of the system have autonomy and it’s possible to build, test and deploy in small increments without affecting the whole.

Agile methodologies drive adaptability in working practices through incrementality. Small pieces of work allow us to gain feedback faster and choose quickly whether to pivot or persevere.

In DevOps terms, we call this The Three Ways – the fast flow of value enables fast feedback from the customers receiving the value supporting a culture of experimentation of learning. We use automation tools to assure quality and stability as we accelerate throughout (CICD). We use telemetry and observability to amplify and shorten those feedback loops. Feedback is going to be slow if we don’t have observability built-in.

And whilst the advantages of microservices over monoliths are pretty clear cut, everything in life is a trade-off. What we gain in adaptability, we lose in complexity. It’s pretty hard to keep track of all those smaller pieces of code and the machines (or containers) proliferate too. Observability has to be embedded in every microservice and we have to know we have the tools to see what’s happening.

Incident management and swarming

Traditionally, ITSM has promoted a three-tiered support model, where level 1 is the service desk. It’s a hierarchical model that relies on process-based escalation. The advantages are that the work is controlled, documented, and understood. The disadvantages are that the structure can cause delays, handoffs, and, consequently, frustration.

Those multifunctional, autonomous teams operate with a “we build it, we own it” mentality. Swarming means that when those incident tickets appear, it’s ‘all hands on deck’ to resolve, so that the whole team can get back to innovating. It requires a flattened hierarchy – which is exactly what we have once we’ve distributed authority to empower our autonomous teams. And it certainly improves the flow when dealing with an incident; no escalations, no handoffs. But does that come at the expense of the highly qualified team members being distracted from creating differentiating features that result in customer value? Yes. What is the cost of the trade-off and how can it be mitigated? The cost is 1:1 – every minute spent fixing a problem is a minute not invested in moving value forwards. But mitigation? What we need to do is reduce the Mean Time to Resolve (MTTR). Enter observability. Most of the time spent in incident management is figuring out what the problem is. With observability, we have ensured the windows into our systems’ souls are open. Observability is key to support DevOps and swarming models. It helps traditional models too, but they’re still suffering from the work process problems they’ve always had and are probably lacking the automation technology too.

There’s a nice knock-on effect here too – the team is hyper-aware of where the problems are coming on, so more like to invest in site reliability engineering and put aside time to pay down technical debt. All of which then releases time for innovation and differentiation.

AIOps and noise reduction

So observability reduces our MTTR. Or does it? Not so much. Observability, a reminder, is a system characteristic. It’s how humans exploit this characteristic that matters. Monitoring helps, but the downside of monitoring these days is that we have so much of it. So much, that teams are reporting ‘alert fatigue. Enter AIOps. Now our autonomous teams have insights when they need to swarm. That noise is reduced through deduplication and correlation and insights into patterns and trends help us pinpoint the problem much faster and that’s how we really reduce that MTTR.

Strictly speaking, we’re reducing the MTTD, the mean time to discover – the teams still need to build and deploy any updates, roll back to the last known good state, or do whatever they choose to fix the problem and resolve the incident.

But AIOps is a self-fulfilling prophecy – it keeps giving teams insights into where problem patterns occur. It’s not just about the ‘hair on fire’ moments; it’s also about continuous improvement. It pinpoints where that technical debt is and offers up insights into where to apply automation to kill toil or create self-healing.

Blamefree, safety-driven organizations

What’s even better, is that having this data supports another DevOps cultural goal – to be data-driven. Which in turn supports the desire for a blame-free culture. Blame is incredibly wasteful, time-consuming, and, frankly, upsetting. It’s a motivational drain and damages our relationships with others, reducing our ability to collaborate which is one of the fundamentals of flow-based working.

When we are blame-free it’s much easier to treat an incident as a learning opportunity, not a disaster to be avoided at all costs. When we have the mechanisms and tools in place (read: observability and AIOps) that reduce the impact of an incident we feel safer – because we are safer. We work together so much better.