Many organizations are considering how best to support multidimensional analytics in the cloud with optimal performance and costs.
Analytics cover every type of exploratory work with data, from complex queries in SQL to extensive machine learning (ML) models and real-time artificial intelligence (AI) processes. Some even include traditional business intelligence (BI) in its scope.
Multidimensional analytics have been at the heart of how business people want to explore sales, inventory, and other core data about performance for many years. Previously, multidimensional analysis was done on premises, on PCs, and in spreadsheets. It has had various names, including OLAP and multidimensional analytics (MDA).
The Origins of Multidimensional Analytics and How It Works
Dr. E.F. Codd, the inventor of relational databases years earlier, first described Online Analytical Processing (OLAP) in 1993. It was a play on words, set against the well-known OLTP (Online Transaction Processing) of operational systems. He defined OLAP as “multidimensional data analysis,” which is actually a subset of online analytics. Two alternative implementations quickly arose: MOLAP, multidimensional OLAP with the data stored in a proprietary cube format, and ROLAP, relational OLAP, where the data resides in a relational database (RDB). There are other acronyms and implementations, but ROLAP has long been the preferred approach, despite some performance concerns. So, why relational?
My original data warehouse architecture in 1988 had an RDB at its heart, chosen for its powerful, logical structure that well represented the relationships between data objects—tables and views of rows and columns linked via foreign keys—and the structured query language (SQL) that business people could easily use. Early RDBs were optimized for OLTP. Data on disk was laid out in row format, and good analytical query performance (which mainly retrieves data by column) required substantial indexes and carefully designed table structures. One such RDB structure—promoted by Dr. Ralph Kimball since the early 1990s—is the star/snowflake schema, or dimensional model, with data arranged into fact and dimension tables. It is optimized for multidimensional analytics and is one of two main designs for data warehouses and marts.
Beyond this logical table structure and indexing, many hardware and software approaches have been applied to RDBs to improve their analytical performance. These include:
- Columnar storage, which arranges the columns (rather than rows) sequentially on disk, providing significant query performance gains for disk-based data
- Compression of columnar data, improving access times further
- Massively Parallel Processing (MPP) using many processors for multiple data streams
- In-memory data storage, which is orders of magnitude faster than disk, albeit far more expensive, as well as solid-state disk (SSD), at an intermediate speed/cost, for caching and even long-term data storage
- Use of advanced CPU internal functions for more efficient data handling
- Specialized logic chips designed for particular calculations or disk access methods
Multiple products—often called analytical appliances—including Greenplum, Netezza, and Vertica, were introduced over the first decade of the millennium, often offering significant performance improvements (up to 500X) over traditional row-oriented databases. Many of these products were later acquired by larger database vendors and aligned with or embedded in existing RDB offerings. On-premises MDA today is largely delivered through such database products, front-ended with traditional business intelligence (BI) tools, such as MicroStrategy and Cognos, or visual BI tools, such as Tableau or Qlik.
Multidimensional Analytics Lifts Off for the Cloud
Moving computing functions, particularly BI and analytical processes, to the cloud to support digital transformation, growing data volumes, and cost reduction is now widespread. However, such migration requires careful thought and planning. The cloud environment differs substantially from that found on premises in its technology base and maturity.
The long evolution of on-premises MDA products and underlying databases, as well as IT support strategies, means that many businesses have consolidated multiple types of reporting, BI, and analytics (including MDA) to a single, common relationally-based platform: their data warehouse.
Migration to the cloud may therefore be seen simply as the selection of, and move to, a cloud-based data warehouse based on a single vendor’s RDB and associated tools. Several cloud-based data warehouse products have become popular in recent years. Often using column-based storage, compression, and cloud MPP approaches (bullets 1-3 of the above list), they are better optimized for analytical uses than OLTP-centric databases. So, the obvious strategy might be to use one of these. However, experiences of performance and costs for specific types of analytics, particularly MDA, have sometimes been disappointing in practice.
As happened in the on-premises environment in the early 2000s, higher performance solutions for MDA needs have appeared, often built on open-source data stores such as ClickHouse, Apache Druid, and Trino. However, more is needed and can be done.
Next Generation Multidimensional Analytics Engines
The StarRocks project, an independent open-source analytical database, offers a set of functions that is particularly suited to modern high-performance MDA needs. With a specialized cost-based query optimizer and vectorized query engine, caching of materialized views, and the use of advanced CPU functions, it can speed up multidimensional queries by an order of magnitude compared to general-purpose databases and a 2-3X improvement over other cloud analytical appliances.
Modern MDA requirements often include analysis of real-time data. While many column-based databases require batch pre-processing to load data, or manage data in immutable stores, StarRocks has its own storage and uses an update-in-place approach, offering low latency for real-time MDA applications such as real-time demand forecasting and anomaly detection.
Beyond performance and latency, MDA implementations must also consider data management concerns, in particular, the need to avoid multiple copies of the same data in different formats to support varying needs. StarRocks supports multidimensional data in either a star/snowflake schema or a denormalized, wide table format, both in its own database, as required. Furthermore, it can also access data in external tables, such as Amazon S3, and enables federated queries across a variety of sources, further reducing the need to copy data in the MDA environment.
The migration of analytic and informational processes from on-premises to cloud is now well underway and accelerating at pace. Having begun with big data analytics in data lakes, many organizations are considering how best to support data warehouses and marts, reporting and BI, and multidimensional analytics in the cloud with optimal performance and costs. At its current stage of evolution, best practice in the cloud likely requires multiple platforms with differing strengths to address different types of analytical functions.
For MDA / OLAP requirements, delivered with good cost performance, especially at high data volumes and low latency, specialized cloud analytical appliances currently offer good options.