Real-time visibility enables organizations to better improve agility and scalability by embracing the hybrid cloud without losing the insight needed to ensure that systems and data remain secure, performant, and resilient.
The de facto operating model for IT organizations, according to a recent cloud adoption report by Cisco Systems, is the hybrid cloud. Companies generally find it combines the best of both worlds: the scaling ability and agility of the cloud, along with the predictable and sometimes cost-effective control of an on-premises environment.
Yet, despite its widespread adoption, the cloud can create blind spots, only letting organizations see what network traffic goes in and what comes out. What does or doesn’t happen inside the cloud remains largely invisible. That presents a risk to the security, integrity, and resiliency of an organization’s systems and data.
These unseen aspects of hybrid cloud infrastructure, left unaddressed, are an open invitation to exploitation by malicious actors, providing unauthorized access to client and customer data and intellectual property. Moreover, this lack of visibility can allow performance issues to fester and even lead to service outages. How can you remediate or manage performance issues when you don’t have any insight? Is it an SLA problem with the cloud vendor? Is something wrong with the configuration of the systems? How can you tell?
Whatever the case, the result can be operational disruption, financial losses, regulatory penalties if sensitive data is compromised, and damage to an organization’s reputation.
For cloud infrastructure and service delivery, there are two top priorities: resiliency and robustness. The need for (or lack of) observability impacts both.
Resiliency, Robustness, and Observability
Resilience is the ability to recover and reset after a disruptive event that causes a service outage. Robustness, in this context, is the ability of systems to work effectively under adverse conditions. The cloud, in general, is quite robust but not necessarily resilient. To operate at the sought-after “five nines” service level, an IT operation must be able to recover from outages quickly. Yet that recovery requires visibility into the service chain to identify and remediate issues.
Not all industries are impacted equally. Companies operating in financial services and healthcare markets have a bigger target on their backs and are thus subject to stricter regulatory and compliance requirements for the hybrid cloud observability of both on-prem and cloud services.
As Slack and Zoom both discovered, the consequences of a lack of observability can be dire. In 2021, Slack experienced a seven-hour service outage when AWS load-balancers couldn’t keep up with a sudden increase in load from employees returning from holiday vacation. The company’s algorithms interpreted the load-balancer’s dropping of packets as “low-load,” scaling down the servers in response. For Slack, the lack of visibility led to causal misinterpretation and improper responses.
Zoom experienced a similar incident in 2020 when a software update resulted in increased transactions to their billing system. Since the overloaded billing system was integrated with the company’s web portal, signed-in users couldn’t join meetings or webinars. Zoom’s lack of real-time network observability prevented them from seeing the increase in transactions and acting before the servers stopped responding.
In its 2022 outages report, Uptime Institute—which monitors the world’s digital infrastructure—found that the costs and effects of this downtime are getting worse. According to Uptime, a full 40% of outages cost between $100,000 and $1 million, while another 16% fall over the million-dollar mark. There’s a greater than 50% chance these outages will cost hundreds of thousands of dollars, minimum.
Robustness and resiliency will remain top priorities for management teams in service delivery. Yet they can conflict with the desire to boost productivity and minimize costs through outsourcing operations to cloud services and colocation. According to research from Uptime Institute, most data center operators and IT managers surveyed experienced outages caused by issues with third-party providers over the past three years. Quickly understanding and remediating root-cause issues is vital to minimizing or avoiding these expensive service outages.
Cloud Observability in Real-time
So, how can companies manage what they can’t see? Cloud providers are often quick to acknowledge failures yet seldom provide context and commentary that allows companies to understand and prevent future occurrences. Even in the best-case scenario, customers are at the mercy of cloud providers regarding insight.
Relying on logs and traces for monitoring is risky and leaves companies particularly vulnerable to problems with cloud provider infrastructure. In the cases of Slack and Zoom, a lack of real-time visibility obscured their understanding of the issue and ability to fix the source of the error in the cloud.
Packet brokering and capture is the high watermark for cloud observability because it tracks service health and security monitoring tools in real time. However, the major public cloud providers, i.e., Google, Amazon, and Microsoft, vary in how they implement this packet technology to address a lack of visibility.
Google’s GCP and Amazon’s AWS employ similar methods, using endpoint traffic mirroring options and Gateway Load Balancers (AWS only). Essentially, this packet mirroring duplicates the traffic running to and from client applications, forwarding it to security and performance monitoring tools for assessment, then on to a packet capture device to be reviewed later. The primary benefit comes from eliminating the deployment of ad-hoc forwarding agents and sensors in each VPC instance across every monitoring tool.
Even so, sending the right data to the right tool requires a cloud-based packet broker when operating multiple monitoring and security tools in parallel. Thankfully, combining mirroring options with virtual packet brokers has the added benefit of reducing operating/traffic costs since each data stream only needs to be mirrored one time versus once per NPM or security tool.
Azure uses a different approach to visibility, relying on something known as “bump-in-cloud” mode. Only available on certain packet brokers, this allows the device itself to monitor ingress and egress traffic at the subnet, capturing, processing, and delivering data in real-time to performance management, security, capture, analytics, and similar tools.
Whatever platform or path is decided on, real-time visibility enables organizations to better improve agility and scalability by embracing the hybrid cloud without losing the insight needed to ensure that systems and data remain secure, performant, and resilient.