The Cloud-Native Path to Data Anywhere

PinIt

Architecting with Kubernetes is the indispensable centerpiece that makes data analytics exceptionally flexible, running anywhere at the point of business need—and doing so at scale and at high concurrency, performance, efficiency, and availability.

Thousands of enterprises in verticals ranging from financial services and insurance to manufacturing and healthcare are finding that they need public and private cloud, hybrid, and edge deployments to best fit their data management and analytics needs. So it’s no surprise that the concept of distributed cloud is part of cloud’s maturation. Bringing data warehouses, data lakes, and advanced analytics to a distributed cloud architecture is where markets are heading. Extending this architecture to encompass higher-level data management and analytics services naturally leads to the idea of a distributed data cloud. Within a distributed data cloud, enterprise data warehouses won’t just be used to provide analytics to a few hundred business analysts or data scientists in a company but will ultimately be able to power real-time analytic applications that are used directly by an enterprise’s end customers that number in the tens of thousands. The data will be immediately accessible—and yield insights—anywhere.

See also: Cloud Adoption Trends of 2021 Amplify in 2022

Exploring the destination

Cloud-native is a term tossed around a lot, but it has real meaning when software architecture is designed from the ground up to take advantage of distributed cloud’s benefits. A fully realized cloud-native data warehouse should logically leverage a distributed data cloud architecture. In the broadest terms, that brings analytics to the data wherever it resides (not the other way around), mitigates concentration risk, dramatically increases efficiency, and ushers in modernization for controlled spend and a competitive edge. 

To put a finer point on it, a cloud-native data management and analytics technology should display five key traits to align with the distributed data cloud blueprint:

  • A platform-agnostic runtime allowing the provisioning of data and analytics anywhere
  • A common user experience anywhere
  • Common security and governance features on any deployment target
  • Cost and technology efficiency anywhere, minimizing resources and allowing for strong cost management (FinOps) and spend guardrails
  • A single control plane, tying all deployments together, public cloud, on-premises, and at the network edge

Deployable at the point of need anywhere, a fully realized cloud-native data warehouse following this pattern will also abstract away the complexity of cloud, on-premises, and network edge infrastructure from end-users. The point is to free them from infrastructure details and allow them to focus on generating value from analytics and managing data while still imparting the native power of cloud.

Choosing the right guide

So, how is that destination reached? Kubernetes, the open-source container orchestration tool, provides the most popular path to cloud-native operations. While the idea of partitioning workloads in Unix has been around since the 1970s, it was only about a decade ago that containers were widely implemented to make application development easier, more portable, and efficient in resource use. But deploying hundreds or thousands of applications across a vast microservices architecture proved extremely tricky. While other options exist, Google’s open-source Kubernetes project, now maintained by the Cloud Native Computing Foundation, rose to prominence to solve microservices application orchestration—enabling applications to run on generic infrastructure, be monitored and managed in standard ways, and be authenticated using open standards.

That’s well and good for applications. But what about the world of data? The same foundational container orchestration is required for cloud-native data warehouses to offer elasticity and deployment flexibility across public and private clouds, network edge, hybrid, and fully distributed clouds.

Cloud-native re-architecture for scale-out web applications is commonplace, but databases have mostly just been “lifted-and-shifted” into the cloud-native world. Plonking a database into a container allows it to run in modern infrastructure, but it doesn’t offer an experience that demonstrates all the benefits of cloud. The software is largely ignorant of the fact that it’s running in a container environment, and operations such as managing elastic clusters have to be clumsily handled from outside the database by hand using Operators and hacking Helm charts. Features such as allowing multiple, elastic on-demand compute clusters to share the same underlying data in object storage are often unavailable. Users seeking to get business value from an elastic, cloud-based data warehouse don’t want to know about Helm charts, pods, nodes, or configuration files. They want to just provision data warehouses, manage elastic clusters and gain insights from their data.

Providing a SQL interface over Kubernetes to provision multiple, elastic clusters on demand and to hide Kubernetes complexities from DBAs and end-users is the answer.

In this manner, different users can be assigned to run workloads on different compute clusters, and the compute cluster in use can be changed at runtime through SQL, subject to permission. Clusters can be configured to automatically suspend after an idle period and spin back up again on demand. For example, a separate compute cluster could be created to run ETL processes when needed, one for ad-hoc business intelligence (BI) and multiple data science clusters. Compute clusters can be expanded online during periods of heavy usage or turned off during quiet periods to save money. Clusters can be created to run daily, weekly, or monthly batch reporting tasks that are only active during those time periods. Both the size of the nodes in the compute cluster, as well as the number of nodes, are controllable in this model, and limits on resource consumption can be established at the instance level for predictability. Similarly, it’s possible to set up a low-cost replica system that receives replication traffic from a primary data warehouse instance, which can then be scaled up on demand when the replica needs to be used.

This kind of elasticity is implemented not just by deeply integrating with Kubernetes but by using SQL itself as the “user interface” for creating, suspending, resuming, and managing clusters instead of developer tools. Kubernetes is the authoritative source of truth for the state of all clusters. System views showing the state of the clusters source their data from Kubernetes using its APIs. When cluster management SQL statements are entered, the cloud-native data warehouse reaches out to Kubernetes to change the desired state of an instance; Kubernetes then implements the necessary changes. If a node in the cluster becomes unhealthy, Kubernetes will bring a replacement online.

This represents a unique, inside-out relationship with Kubernetes: Rather than Kubernetes being the ‘user interface’ for driving the state of the cluster, the database itself, which is managed by Kubernetes, becomes the user interface. This architecture creates a symbiotic relationship that delivers a unique, fully realized cloud experience. The power and cross-platform flexibility of Kubernetes becomes available to a data warehouse, driven entirely through SQL.

As more data is generated and more use cases deployed, it’s easy for enterprises to enter a vicious cycle where their ecosystem gets increasingly entrenched within a particular cloud. Systemic risks can arise in that single cloud that presents too much exposure for critical IT infrastructure in heavily regulated sectors like financial services and insurance. Architecting with Kubernetes isn’t the only core concept that brings a fully realized cloud-native data warehouse to life. It isn’t the only architectural component aligned with the distributed data cloud pattern. But it’s the indispensable centerpiece that makes data analytics exceptionally flexible, running anywhere at the point of business need—and doing so at scale and at high concurrency, performance, efficiency, and availability. The upshot is that thousands of users in any given enterprise, across different business lines and geographical regions, can make extremely fast decisions and generate value from analytics in motion in near real time.

Mark Cusack

About Mark Cusack

Mark Cusack is the CTO at Yellowbrick. Mark has worked in the data warehousing and advanced analytics space for the past 20 years. He was a co-founding developer at the data warehouse archiving company RainStor, which was acquired by Teradata in 2014. Mark holds a PhD in computational physics, and has worked in academia, government, startups, and enterprises over the course of his career.

Leave a Reply

Your email address will not be published. Required fields are marked *