What’s the one thing that never changes in the Data World? Even with the rise of the Modern Data Stack, many new tools, like little chicks hatching from eggs, go on to glory, while others wither on the vine, the circle of life. One steady thing has been the same through the eons … analytics.
It’s safe to say Data Warehouse Analytics is the workhorse of our age. Ever grinding and churning out aggregation after aggregation.
Working in and around the data space, analytics, and Data Warehouse low these many years, nothing has changed.
The business still needs dashboards.
A multitude of analytics still need to be calculated.
Analytics are still hard to get right.
Another thing that hasn’t changed much? People want their data fast and Johnny on the spot. No waiting. If they wait, there must be something wrong.
Latency. It’s still a big deal and a hard problem to solve at scale.
This blog is brought to you by Firebolt. It’s a fast cloud data warehouse I’m currently exploring for low-latency use cases, and it’s able to deliver sub-second response times over TBs of data.
Latency + Analytics + Data Warehouse
It’s a simple problem to understand that has been around for almost two decades, latency, that is, but not an easy one to solve.
Find yourself a Data Warehouse producing analytics, and you will find yourself analysts, data engineers, and business intelligence folk, all at some point fighting latency problems.
The real question is, do you have a consistent process and design to deal with it? Most likely, it is some haphazard approach, throwing gum at the wall to see what sticks.
The Concept of Latency in Data Analytics
We should probably stop for a minute and review more about what we mean by latency problems inside a Data Warehouse.
Most often the entire reason behind creating and accumulating data inside a Data Warehouse is to do analytics on that data, create dashboards, join different data sets, and generally give insights into large-scale datasets that drive business decisions.
Of course there is a lot of work that goes into data models, ETL, and business logic massaging of the data to get it to a certain point where it becomes useful.
But, that isn’t the end. Those insights have to be presented to the consumers, typically in the form of reports and dashboards. This is where the latency problem comes in, and it is a real problem.
Simply put, the consumers of analytics simply aren’t willing to wait around for even 15 seconds for a report or dashboard to load. They are busy, have a lot to do, and like it or not, will simply ignore or mistrust the data if it isn’t snappy and quick, easy to consume.
Users expect a seamless interaction with analytics.
They are not willing to wait more than a few seconds for data to load.
Analytics at their core, aggregating datasets, are time-consuming and resource-intensive.
To prove the fact that data latency is still a serious problem in analytics, I simply brought up the question of how people solve this problem on LinkedIn.
Not surprisingly, I received a flood of comments and interactions to this question within a single day. The comments made it clear this is a problem people struggle with, and one people struggle to solve. There were all sorts of ideas …
If this still doesn’t make any sense to you, let’s look at a simple example. This query is probably run millions of times a year in some form or fashion.
Simply calculating dollars or sales, but some product, over some time period.
There could not be an easier query, and that isn’t really the problem. The problem is when you have millions and billions of records over many years and someone opens a dashboard or report … and then nothing happens for a minute or two.
They close it out, get frustrated, and go tell the Data Platform team nothing is working and why are we spending all this money if we can’t get a simple answer like this in under a few seconds.
Definition of latency in the context of data processing and analytics.
I think it’s important to get a “lay of the land” so to speak, of latency in the context of what Data Engineers usually are dealing with when it comes to analytics.
Types of latency:
Data latency
Compute latency
Query latency
Network latency
We could probably argue back and forth about the nuances of latency when it comes to analytics. Still, generally speaking, these four main areas can summarize when most of the problems lie.
Of course, each one of these areas deserves its own article and deep dive, but we can cruise past these and get a 10,000-foot view that can be helpful in understanding what latency problems analytics solutions inside Data Warehouses usually struggle with.
Diving into latency
Data latency
Let’s start with data I/O latency, I will also refer to this as file I/O, why? Because it’s common for many of the current Modern Data Stack Data Warehouses to be built with file-based systems like Delta Lake, Iceberg, etc.
The differences between Data Lakes, Lake Houses, and Data Warehouses are more in the eye of the beholder and have a lot to do with data modeling and implementation, rather than the underlying technology.
When processing large datasets stored in some file system and file type, say parquet files, all the sudden file I/O becomes a major factor.
File size
Type of files
Number of files
Compute latency
Next on the list is compute latency, this simply refers to the latency required to utilize CPU and Memory to literally compute the analytics in question.
Most analytics, reports, and dashboards at some point are doing aggregations of many different kinds, this work must be done by CPU, and on large datasets simply calculating a SUM of sales over a large time series, like in our above example we mentioned, will take time, depending on the compute architecture.
Query latency
This is probably one of the most common issues in Data Warehouse analytics, simply put, bad queries.
It’s hard to overcome bad queries with hardware and other antics like caching data etc, bad engineering practices can overcome, in a bad way, the best-laid plans and architectures.
Understanding how to implement different query optimizations, like when to filter, when to join, indexes, and a plethora of other practices will have a major impact on the latency of Data Warehouse analytics.
Network latency
Last but not least, one of the hardest problems to overcome is network latency, simply moving all those bits and bites over the network and possibly between nodes in distributed computing environments.
This is a problem that has been hassling Data Engineerings working on producing analytics on large datasets for decades, and the solutions aren’t always so simple, often involving multiple approaches to solve.
Wait, there’s more.
Unfortunately, this isn’t the end of latency problems inside Data Warehouses, the list of critical paths leading to problems is long and varied.
Lest you think I’m ignorant of other major concerns, here is a list of more …
Wrapping it up with possible solutions.
Why is latency a critical issue in data warehousing and analytics? Because like it or not, we data teams rise and fall with the trust business has in us, our data, and the analytics and products we produce.
Nothing will sour the mood and the way we are viewed as poor user experience and latency in what we build.
It’s not an understatement that latency seriously impacts business operations. Product and marketing teams, as well as C-suit stakeholders, rely on data to make decisions.
It’s also clear that the data world is headed towards using more near-real-time data analysis. Latency problems can make this impossible.
What is the solution to latency issues?
It’s hard to come up with a one-size-fits-all solution to latency since the problems are very context-specific. What type of Data Warehouse are you running, what technologies are in use, what are the front-end tools, and how much data is being processed?
There are two sets of possible solutions to the above problems. There are sets of engineering and software principles that can be applied to help alleviate as much as possible some of these latency issues. As well, there are several companies that’ve done a good job of solving these latency issues on large datasets, such as Firebolt, Clickhouse, Rockset, and Imply.
What are some of the practical steps that you can put in place immediately?
Data modeling work
Advanced data caching strategies
Indexing
Use of faster hardware and scalable cloud solutions
Pre-aggregating datasets and rollups.
Technical debt deep dives and fixes
Better architecture
Query optimization
In future articles, hopefully, we can start to dive in-depth into more of the problems AND solutions around latency in Data Warehouses and analytics.
Detailed content and thoughts. Hopefully we are going to see improvement in these areas in near future.