Hello readers. Today I’m bringing you a guest post from an anonymous Rust developer. He’s no stranger to Data Engineering Central.
This is someone I know personally, have been on car rides with, camped in the woods with, sat around a fire with. They are one of the smartest and most wonderful programmers I have ever met.
Enjoy.
If you're new to window functions, you might be tempted to Google the term to see what pops up. When I do so, and see the relevant Wikipedia article, I get the following definition:
In signal processing and statistics, a window function (also known as an apodization function or tapering function) is a mathematical function that is zero-valued outside of some chosen interval.
Whoops, wrong page. They have a different one for the SQL variant (which is why we're here)... and yet, maybe we can take something away from this definition. Let's look again at that statement:
... function that is zero-valued outside of some chosen interval
And therein lies the gist of it: a window function is aggregating; and, in SQL statements, this can be done on the row level, so maybe by adding one and one you can guess that this is a tool for row-level aggregation.
Is that the whole story, though? Stick around to learn more about one of my favorite tools for building meaningful queries.
The naive approach
First, to know what a window function is, we should look at the kinds of problems they solve. For this, I'll be turning to the trusty Northwind database, and in particular, I'm interested in the orders table.
To start, a lay of the land:
SELECT
min(order_date), max(order_date), count(*), count(distinct customer_id) as distinct_customers
FROM orders;
Over a roughly two-year period, we can see 830 orders placed by 89 customers - surely, at least one of those customers has placed repeat orders. We might want to see on a per-customer basis how they rank in terms of order frequency, and this could be done in a number of different ways; for now, we'll use temp tables to collate our information.
First, we'll need to know the count by customer_id
:
select customer_id, count(*)
from orders
group by customer_id
order by count(*)
... and so on.
As can be seen above, the same count could be shared between multiple customers. Also, scanning all of the results, you might even see gaps in the counts:
...So, clearly, we can't just use the count as a rank since we jump around a bit. So, the first thing we need is to know how the counts rank; and, since our ranks are probably assigned in terms of descending quantity, we'll also need to ensure our largest count gets the lowest rank figure.
CREATE TABLE count_rankings (
ranking serial primary key,
quantity int
);
INSERT INTO count_rankings (quantity)
SELECT distinct count(*)
FROM orders
GROUP BY customer_id
ORDER BY count(*) DESC;
If you were to peek at the contents of our count_rankings
table, you'd notice something like:
With this as a lookup table, we can then cross-reference our row-level counts as we consider each customer:
Surely, there's an easier way (?)
...and you wouldn't be wrong. But first, let's think about why we had to break it up into multiple steps.
Basically, we needed state, but a SQL query is by its nature stateless. And, by "state" I'm referring to some kind of running tally or accumulator to keep track of how one count compares against the next. The lookup table count_rankings
in the above example served as this state, but it's pretty clear to see that this is a tedious way of figuring out the ranking.
Alternatively, thinking of this problem in terms of code (e.g. Python), you might approach this as follows:
Create a dictionary of counts (value) by customer ID (key), by looping over all of the orders
Create a set of counts (or create a list and deduplicate it)
Sort the counts list in descending order
Update the dictionary of counts so that the value is a tuple of a) count, and b) the index of that count in the sorted list
Any way you look at the problem, it seems like it shouldn't be possible to do it with a one-liner... right?
Enter: DENSE_RANK
:
The DENSE_RANK() assigns a rank to every row in each partition of a result set. Different from the RANK() function, the DENSE_RANK() function always returns consecutive rank values.
Wow, that's a mouthful. Note from the description that the RANK()
function doesn't give us consecutive values, though if you had approached this problem naively you may have been led to believe it would.
Also, there are a few keywords I would like to pluck out from that description:
rank
partition
consecutive
The first of those, "rank", is what we're attempting to accomplish.
The second, "partition", describes how we break apart our initial results to determine how things are ranked. For now, since our rankings apply over the entirety of the order history, we won't further partition the results.
Finally, we wanted to describe a sequential ("consecutive") ranking of our counts. Two customers with the same order count are ranked identically using the
DENSE_RANK()
function; had we wanted to give unique ranks, we might instead have gone with theROW_NUMBER()
function.
So, with all that in mind, let's first identify our base query:
SELECT customer_id, count(*) AS count_of
FROM orders
GROUP BY customer_id;
Adding a ranking value is as simple as adding a new projection column with a call to the appropriate function:
SELECT
customer_id,
count(*) AS count_of,
DENSE_RANK() OVER (ORDER BY count(*) DESC) AS ranking
FROM orders
GROUP BY customer_id;
This gives us identical output to what we got with our naive approach, and with far less effort. Also, it's probably a lot easier to follow along with than the multi-statement approach from earlier.
Partitioning
Because this example skipped over partitioning, I'd like to revisit that by modifying our query's requirements a bit: now that your Pointy-haired Boss (PHB) knows that you can rank your orders, he might next suggest we figure out how to rank salespeople within a team (defined by a shared manager).
As before, we'll need to start with a base query that defines these grouping characteristics:
SELECT
CASE
WHEN employees.reports_to IS NULL THEN '(None)'
ELSE TRIM(CONCAT(managers.first_name, ' ', managers.last_name))
END AS manager,
CONCAT(employees.first_name, ' ', employees.last_name) AS salesperson,
count(*) AS count_of
FROM orders
INNER JOIN employees
ON orders.employee_id = employees.employee_id
LEFT JOIN employees managers
ON managers.employee_id = employees.reports_to
GROUP BY
employees.first_name,
employees.reports_to,
employees.last_name,
managers.first_name,
managers.last_name
ORDER BY
managers.first_name,
managers.last_name,
employees.first_name,
employees.last_name;
This one was a bit more work; I've added in employee names for both the manager and the salesperson, and because some salespeople don't actually have anyone they report to it was necessary to account for that.
So, to illustrate why we need to partition our data, let's first just slap a DENSE_RANK
call onto this and see what happens:
SELECT
CASE
WHEN employees.reports_to IS NULL THEN '(None)'
ELSE TRIM(CONCAT(managers.first_name, ' ', managers.last_name))
END AS manager,
CONCAT(employees.first_name, ' ', employees.last_name) AS salesperson,
count(*) AS count_of,
DENSE_RANK() OVER (ORDER BY count(*) DESC)
FROM orders
INNER JOIN employees
ON orders.employee_id = employees.employee_id
LEFT JOIN employees managers
ON managers.employee_id = employees.reports_to
GROUP BY
employees.first_name,
employees.reports_to,
employees.last_name,
managers.first_name,
managers.last_name
ORDER BY
managers.first_name,
managers.last_name,
employees.first_name,
employees.last_name;
In case it isn't immediately apparent, we're pitting each salesperson in the company against each other.
And, let's be clear - Andrew Fuller is too important to be compared to the likes of his subordinates, so of course your PHB's going to push back.
This is where partitioning comes in; we want to divide these results up so that only those within the same team are ranked against each other. I'll start by modifying the DENSE_RANK's OVER
clause to include the partitioning scheme:
DENSE_RANK() OVER (PARTITION BY managers.employee_id ORDER BY count(*) DESC)
If you made this change and ran it right away, you'd be yelled at for referencing a field that isn't accounted for in our statements GROUP BY
clause. I'll make that change now, and the resulting query should transform into:
SELECT
CASE
WHEN employees.reports_to IS NULL THEN
'(None)'
ELSE TRIM(CONCAT(managers.first_name, ' ', managers.last_name))
END AS manager,
CONCAT(employees.first_name, ' ', employees.last_name) AS salesperson,
count(*) AS count_of,
DENSE_RANK() OVER (PARTITION BY managers.employee_id ORDER BY count(*) DESC)
FROM orders
INNER JOIN employees
ON orders.employee_id = employees.employee_id
LEFT JOIN employees managers
ON managers.employee_id = employees.reports_to
GROUP BY
employees.first_name,
employees.reports_to,
employees.last_name,
managers.employee_id,
managers.first_name,
managers.last_name
ORDER BY
managers.first_name,
managers.last_name,
employees.first_name,
employees.last_name;
Not everything's a contest in life
...and not every window function is a DENSE_RANK
. As mentioned earlier, there's also the commonly used ROW_NUMBER
, which works similarly to DENSE_RANK
except that it doesn't reuse the same number - if two of those salespeople had shared the same count of sales, they would still have been ranked differently were we to swap out one function for the other.
So far, everything you've seen is based on what PostgreSQL can do. Window function support may differ from one platform to the next, and some may even be vendor-specific. You can see the list of Postgres-supported window functions in their docs.
BONUS: Also in the same vein, but not exactly a traditional window function, is the aggregate function ARRAY_AGG
, which will effectively turn a column of data into a list value:
SELECT
CASE
WHEN employees.reports_to IS NULL THEN
'(None)'
ELSE TRIM(CONCAT(managers.first_name, ' ', managers.last_name))
END AS manager,
CONCAT(employees.first_name, ' ', employees.last_name) AS salesperson,
count(*) AS count_of,
ARRAY_AGG(DISTINCT orders.customer_id) AS customers
FROM orders
INNER JOIN employees
ON orders.employee_id = employees.employee_id
LEFT JOIN employees managers
ON managers.employee_id = employees.reports_to
GROUP BY
employees.first_name,
employees.reports_to,
employees.last_name,
managers.employee_id,
managers.first_name,
managers.last_name
ORDER BY
managers.first_name,
managers.last_name,
employees.first_name,
employees.last_name;
Last thoughts
Hopefully, it's clear at this point that you can save a LOT of work by making good use of window functions where appropriate. It gives power to SQL developers that have in the past only been available either through great effort or would have been handled in post-processing by developers of downstream applications.
Because they're stateful, there is a bit more work for the query planner to handle, but if that's important to you then you should benchmark for yourself the various ways you could solve such problems.
All things considered, this is a powerful tool and one that should hopefully bring both brevity and clarity to your SQL.
great way to explain window functions, thanks for this.