Struggling to get the most out of that SQL query? Wondering why it takes so long for that simple query to run? Getting angry emails from users and tired of being blamed for every slow report or dashboard?
If you aren't indexing your data, or you're not using the indexes you have, you may be leaving massive performance gains on the table (is that a pun? I'll let you decide...).
Indexes are one of the pillars and foundations of what data teams have been working on for decades, yet, if you scroll the endless LinkedIn or Twitter tech universe, I would hazard a guess it’s pretty much impossible to find folk talking about indexes.
Today, we set the record straight, and crack the book open for those who’ve been wondering what indexes are all about, what they are, and how they are used. Hopefully, you leave this article with a broader knowledge of indexes and their uses, that’s the plan!
This blog is brought to you by Firebolt. It’s a fast cloud data warehouse I’m currently exploring for low-latency use cases, and it’s able to deliver sub-second response times over TBs of data.
What even is an index?
For those old enough to remember, phone books were a popular way of letting you know how to get ahold of the people in your community. The "White Pages" gave residential numbers, while "Yellow Pages" provided a catalog of business numbers and related advertisements.
Consider what you might see in a typical white-page listing:
Last name
[Optional] First name(s) and/or initials
Street address
Local phone number (area code was already known by virtue of which book/section you were reading)
Before the internet, if you wanted to find a person's contact information, this was the tool. You would query the book by last name, and (fingers crossed) they had a distinctive name and were easily teased out. What you were dealing with is a table of people indexed by name.
What if, though, you were to start with a phone number and have a desire to know who it belonged to? With a normal phone book, this was a herculean task - the phone numbers weren't sorted numerically, and if you were to match a number to its corresponding owner you could probably spend days scanning from top to bottom until you eventually found the corresponding name. This technique, known in the database world as a scan, is the process of working top-down through your data to naively search for something.
So, to reiterate, when you're searching your data using indexed keys, you're going to get your data very quickly. Without that index, it's monumentally more difficult to track down the matching data.
This is why indexes matter in SQL and the “data world.”
Indexes are critical to operations in most SQL-based systems that involve any data of size. If you don’t understand and use indexes, and your data grows, it’s inevitable that troubles will quickly start to show in the system.
If you think about finding a piece of data, say a customer record, within millions or billions of records, how can you quickly find what you need? Your options are to either scan all the records till you find what you want, which can take a long time, or use an index to jump straight to the record you need to read.
Ok, so we can all agree indexes are critical to SQL operations if you need performance, and we all do. Next, let’s talk about some common types of index you’ve probably heard of or will hear of, in your journey through SQL and indexes.
Clustered
Non-Clustered
These are the two most basic index types that you should understand in the context of most SQL-based indexes.
Clustered Indexes
Of course, there are nuances with different database systems, but there are two central factors that make clustered indexes very useful and useful as tools in the belt of data engineers and others working on SQL-based systems.
Sorting
Uniqueness
As for sorting, the rows of data are physically stored on disk in the order of the index keys. This means that the data is stored in a sequence that matches the indexed column. This is very helpful when you’re searching for a particular piece of data as it’s related to those around it.
Secondly, most times clustered indexes will be unique or have a high degree of uniqueness. Again, many database systems will automatically create clustered indexes on the “primary key” of a table.
Clustered Indexes In-Depth
In our white page example, we saw that the last name is the sort sequence and the data point we initially used to look up a person's record. Unfortunately, some names aren't unique (see below), and since the first name isn't guaranteed to be present or otherwise unique, you might get many hits for a single lookup.
The situation's a bit better in the Yellow Pages. A local business name will be unique by virtue of a state's business registry, and so if you know a business's name you're virtually guaranteed a 1:1 match. As with the White Pages, records are sorted alphabetically by name.
You could store both White and Yellow Page data as tables, but only the Yellow Pages allow you to use the name as a primary key, which has the requirement of being a unique data point. Continuing with the Yellow Pages illustration, that business name functions as not only a primary key but also a clustered index.
What makes an index "clustered" is the fact that it is stored within the table. When you search a clustered index for a record, you're also looking directly at the record you're searching for. This index is created for you automatically when you define the table's primary key, and only one clustered index can exist per table.
Sidebar: Clustered indexes don't need to be unique, and don't require a primary key, but it's considered bad form to have non-unique clustered indexes.
Non-clustered indexes
What if you were to start with a phone number, and wanted to know who it belongs to? A traditional phone book wouldn't be able to help you with this, since it's only indexed by name. Still, phone numbers are unique and owned by a single account owner, so in theory, you could index by number rather than name.
A reverse-lookup service could tell you this information. These weren't commonly available (likely for privacy reasons), but with them, you could look up the owner of a phone number.
Supposing this lookup table contained a phone number and the account owner's name, you might then use that information to go back to the white pages to track down their primary listing and get a street address.
Stored separately from data.
There can be multiple non-clustered indexes.
Indirect data access.
This will add overhead to the database system
In this scenario, the reverse-lookup service is serving as a nonclustered index - one that is stored separately from the records and refers back to the original via indirection.
Because this lookup table is stored outside of the source table, there is some additional work involved in maintaining the lookup data any time the source record it refers to changes, and the two-step lookup process is also a bit more work to perform than a direct search as with clustered index queries.
Heaps
Finally, if we have a table that isn't backed by an index, it's considered a heap - an unsorted collection of records. Or rather, any sorting is at the discretion of the database engine.
If order isn't important and a primary key doesn't make sense, you might intentionally choose this data structure, but the only way to get a specific record out of this structure is to use the expensive table scan mentioned earlier.
When to (or why) index?
Ok, so we now know a little bit more about indexes and some of the most common types, but this doesn’t tell us when or why would need an index in the first place. One of the greatest reasons data professionals need to understand and use indexes is the sheer volume of data we deal with.
When it comes to complex analytics, think about all the complicated queries that get written, with many JOINS, and WHERE clauses to say the least. Indexes are important for several reasons.
Improved Query Performance
Efficient Data Access
Optimized for Specific Queries
Reduction in Resource Usage
Impact on the Execution Plan
Any data team knows well the importance of indexes when producing analytics from data warehouses, it’s almost impossible to provide reasonable performance without them. Indexes are critical to providing fast and efficient queries.
Let’s look at an example to help.
Suppose your resident PHB comes up to you and wants to know how many reciprocating dingle arms were shipped to Dubai in 2023 by Bob, one of your lazier salespeople. Turns out you're the first person to ask this question of the database, and your DBA (who's out on vacation, but left you admin rights before he bailed) never crafted a suitable index.
Your query churns for a long eternity since your company is one of the leading global suppliers of parts for turboencabulators, and there's a lot of data to comb through.
The offending query:
Do I need an index?
Whether you need to play with indexes is a subjective decision. If you're doing a one-off analysis and can afford to walk off and grab a coffee while your query runs, you're in a very different position than someone who desperately needs an answer on the spot.
Also, indexes aren't free - they need to be kept in sync with source data, and incur an up-front performance cost in addition to the additional storage needs involved. If you're writing often and querying infrequently, you might be creating more problems than you're solving by adding indexes.
You might also get away with alternative solutions, such as periodic report generation or other caching mechanisms when real-time accuracy isn't needed. This isn't a tool that should be reached for every time you lose patience with a query, as it potentially falls into several traps (YAGNI, bikeshedding, premature optimization).
For our current illustration, though, we've decided it's worth it. You're the DBA for the day, and in your official capacity, you've decided it's the right call.
What do I need in my index?
First, let's distinguish between the two things that the query is doing:
Selection - in the WHERE clause, we have constraints that limit which records are returned.
Projection - the SELECT clause identifies the data returned by the query.
Both of these needs are addressed separately when crafting our index:
For indexing purposes, if it's already accounted for in the Selection columns, it doesn't need to be added to the Projection columns.
What we're effectively creating here is a logical table that contains everything we need to speed up our query. It's a bespoke index tailored to this particular query. Note how it corresponds to our original query's columns.
Implicitly, this index will also have a pointer back to the original record, so you basically get the primary key for free in your projection columns.
We could include additional columns in either the selection or projection portions of the index, and in cases where you believe this index might have value for other queries if it just included that extra bit of data, it might be the right move.
Get too greedy, though, and you add to the incurred penalties described earlier, and you might also convince the query planner that it's not worth using your index.
That's right - even after creating your index, the database engine might not even end up using it. It gets a lot of discretion in deciding the "best" way to query for your results. There are ways to improve this situation if it occurs, but there's also a good reason why DBAs are well-paid professionals.
That's fine and all, but my query's more complex!
Okay, so maybe you're doing some fancy joins and have three pages of constraints in your WHERE clause. The situation is still the same - you need to identify all of the selection criteria and the projection values in play. Consider the below query for a better-fleshed-out illustration (this time using the infamous Northwind schema):
Here, we have two datasets (Customers and Employees), each of which brings something different to the table. Let's break it down for each:
Customers
Selection
Address
City
Phone
Projection
ContactTitle
Address (but since it's already in Selection, we will ignore this)
Employees
Selection
Address
City
HomePhone
HireDate
Projection
LastName
FirstName
So, based on the above, we have two potential indexes:
Understanding Use Cases
Understanding the use cases and query access patterns is probably one of the most important steps when trying to come up with indexes to support a large setup of analytics and queries running on data warehouses.
Without understanding common and overlapping data access patterns, it’s almost impossible to be smart about designing indexes to support those needs.
What tables are used the most?
What datasets are joined?
Where are the common filterings applied?
One of the best things you can do is to inspect all the queries being run and compare them. Find the commonality between them, in many cases, there will be lots of overlapping datasets that are used to do similar joins and filters, this is where you should start.
Always remember though, too much of a good thing can be a bad thing. Indexes cause overhead and maintenance, one has to be very careful not to get carried overboard.
Trailing thoughts
Performance tuning is about so much more than indexes. Sometimes, all it takes is rethinking your query and whether you're asking the right questions of the database. However, when trying to leverage repeatability and performance in your querying, it is one of the most impactful tools at your disposal.
It’s critical to have a basic understanding of the most common types of indexes used across most different types of databases and tools, clustered and non-clustered indexes to begin with. It’s also clear that good index creation and management is half art and half science. Significant time must be put into understanding the data itself, and its common access (query) patterns.
Armed with this information, in most cases, it becomes very clear where indexes can help boost performance.
A brilliant read! This is one of those rare posts that talks about HOW you need to evaluate an index design from a use-case and common query patterns standpoint.
Very well written.
My simple take on indexing is, A technique to allow skipping data smartly resulting in faster scans. Let me know if you are interested in doing collaboration work on this topic (data skipping, partitioning, bucketing, sorting, z-ordering, liquid clustering)