Use cases that demand real-time analytics need a combination of technologies, including distributed storage, a distributed query engine, and caching, to handle petabytes and deliver results within seconds or even sub-seconds.
Real-time and customer-facing analytics necessitate analytical data systems to adopt capabilities typically reserved for transactional databases. Customer-facing systems demand databases that can manage high concurrency, while real-time data entails the ability to append and update (“upsert”) swiftly data in real-time. Not every business decision labeled “real-time” requires a dedicated analytical database; sometimes, a transactional database suffices. However, the case for real-time analytics becomes compelling when dealing with copious data, intricate analytics, and the need for low-latency responses.
The perception of needing real-time analytics often outweighs the actual necessity. Many supposedly “real-time” insights are technologically dated from a business standpoint. Business users may interpret “real-time” as “from today” rather than at the end of the month, and batch systems can handle some of these cases. To your company, if “real-time” simply means fast queries, an efficient query engine rather than a full real-time analytics system will suffice. For simple tasks like retrieving a recently added row, a transactional system would be appropriate. However, a full-fledged real-time analytics system becomes essential in scenarios where the requirement is to perform complex analytical queries, such as aggregations, that also need to include the most recently inserted information.
The concept of “real-time” in real-time analytics lacks a universal definition. How recent must the data be? Can the view tolerate being 10 seconds old, or must it be the absolute freshest result? The answer to this question is tightly coupled with your use case and is of critical importance. The delay in updating data and processing queries, known as latency, affects more than just technical performance. It has a significant impact on both the design of your system’s architecture and the associated operational costs.
In general, achieving lower latency results incurs higher costs compared to higher latency alternatives. For many use cases, the cost of subsecond or fully real-time updates is unjustifiable. However, certain scenarios necessitate real-time, such as the world’s largest gaming platforms requiring immediate updates for in-game activities, influencing actions when analyzed alongside other user data. In industries like high-speed financial transactions, fraud detection tools must operate in real time to prevent substantial losses or competitive disadvantages.
Implementing real-time analytics on a large scale necessitates fast-distributed storage (often AWS S3) and a high-performance distributed query engine (like the Linux Foundation’s StarRocks). Customer-facing analytics may also require a query engine capable of caching results at various levels or employing in-memory processing. Considerations include data complexity—whether preprocessing, such as denormalizing complex data, is necessary for efficient querying. While preprocessing introduces latency, slow query performance also contributes to latency.
Not every use case demands real-time analytics, but for those that do, a strategic combination of technologies, including distributed storage, a distributed query engine, and caching, can handle petabytes of data with millions of users, delivering results within seconds or even sub-seconds. Achieving this balance involves deploying the right analytical technologies and ensuring data feeds are fast enough to meet the application’s demands. Making informed tradeoffs allows for real-time analytics and customer-facing scalability with minimal latency, provided the use case justifies the associated costs.