Doubling down on performance

As a follow-up to https://tuomaspelkonen.com/2023/07/26/building-my-own-analytical-dbms/ I’ve been working on improving the performance even further. I’ve made some good progress in the benchmark queries. I’ve practically caught up (or blown away) the fastest DB running on c6a.metal despite running only on a Macbook Pro M1. Since my solution is starting to be as fast on inferior hardware as all the other DBs on superior hardware, I might have to make this a real thing, but for now it’s a still a performance exercise targeting a specific benchmark.

Benchmark numbers

https://benchmark.clickhouse.com/

Numbers from the queries where my solution was not able to beat the fastest DB running on c6a.metal one month ago. For comparison I added numbers from Clickhouse (self-proclaimed fastest and most resource efficient open-source database for real-time apps and analytics) on my Macbook M1 Pro. Getting consistent results for query 32 with Clickhouse is hard since Clickhouse often uses more memory than the 16GB my machine has to offer.

Query	My solution 7/23 M1 Pro	My solution 8/23 M1 Pro	Best Benchmark c6a.metal	Relative performance	Clickhouse M1 Pro (relative to mine)
18	460 ms	418 ms	440 ms	0.95x	2.27 s (5.4x)
30	132 ms	94 ms	90 ms	1.04x	383 ms (4.1x)
31	211 ms	138 ms	140 ms	0.99x	509 ms (3.7x)
32	730 ms	547 ms	510 ms	1.07x	> 6 s (> 10x)

Below are benchmark numbers from some of the queries where my solution is clearly faster than the other databases in the benchmark

Query	My solution M1 Pro	Best Benchmark c6a.metal	Relative performance	Clickhouse M1 Pro (relative to mine)
12	31 ms	140 ms	0.22x	454 ms (14.7x)
14	53 ms	150 ms	0.35x	497 ms (9.4x)
15	61 ms	80 ms	0.76x	385 ms (6.3x)
16	193 ms	300 ms	0.64x	1.67 s (8.7x)
17	19 ms	100 ms	0.19x	779 ms (41x)
20	47 ms	80 ms	0.59x	1.03 s (22x)
26	9 ms	20 ms	0.45x	213 ms (24x)

Performance Improvements

In the last month I’ve tried to squeeze everything I can out of the performance. I’ve probably tried hundreds of different things and found a few which gave me good results. Below I list the things that made the performance improve the most.

Minimizing Dynamic Dispatch

Dynamic dispatch (the process of selecting which implementation of a polymorphic operation to call at run time) comes with a performance cost when the dispatch is done millions and millions of times. If that is the case, why use dynamic dispatch at all? The reason to use dynamic dispatch is to be able to not know the exact type of the data when making the call. The reason to not know the type of the data is to avoid having the compiler generate code for every single possible type combination.

Using too many generic types (templates in C++) quickly brings the Rust compiler to a complete halt (I managed to do this with 4 generic types and 10 options for each type). So using generics all the time everywhere is unfortunately not an option. Using Rust enums could be another option to avoid dynamic dispatch, however, match can be even slower than the dynamic dispatch. Additionally dynamic dispatch allows to use arrays (behind the boxed trait object) which are more performant than vectors. Using arrays with enums does not make sense since all the variants in the enum will be the same size.

Since dynamic dispatch is still sometimes the right choice, it makes sense to minimize the number of calls. Previously I invoked dynamic dispatch for every column to populate the group-by and aggregate data for each row. I changed this to be two calls: One for all group-by data columns and one for all aggregate data columns for each row. The reason not to try to do both of these in a single call is again the generic code generation limit. Keeping group-by columns and aggregates columns separate reduces the number of different generics types and keeps compilation times reasonable.

The more substantial reduction in using dynamic dispatch was to group-by and aggregate multiple rows at once. This, however, required the data to be copied to temporary data structures before it was copied to the final structures. Trying to copy the data directly to the final structures would again hit the generic code generation issues. In this case a single fixed sized array for group-by data and a single fixed sized array for aggregate data are used as these temporary data structures. Copying data in and out of the same arrays over and over again turned out to be quite performant and extra copy step was well worth it.

Doing less

The classic saying is that the fastest code is the one which never runs. I was able to do use this idea when populating the values for per row mapping. Since the data is compressed in various ways, I need to build a row to value mapping when multiple columns are involved. To do this I populate an array of size 2¹⁹(the total number of rows in block) with the values for each row. Some columns represent boolean values using 0 and 1. Previously I initialized an array with zeros and then wrote 400000 zeros in it. This was obviously unnecessary and it’s enough to populate the array with only non-zero values. Similar example is when the most common value is not a zero and majority of the rows have this value. Initializing the array with the most common value first and only populating the other values made things a bit faster in some cases. Filling a large array with non-zero values is definitely slower than filling it with zeros so this strategy has its trade-offs.

Faster code for the worst case

The worst case for a group-by is when all the values are unique and no rows can be combined. This makes the query very slow because all the rows are kept in memory. The benchmark queries 31 and 32 are examples of this. Despite this issue, building the group-bys and aggregates is very simple since each value gets its own row. Keeping this is mind I was able to write a very simple dedicated loop for this scenario to avoid everything unnecessary. This improved things a lot since the loop is run 100 million times in both of those queries.

Faster data format

In the previous post I briefly explained how I use variable-length delta encoding to pack the rows for each value very tightly. This solution comes with a performance cost since each row delta is preceded by the information how many bits/bytes are used for the delta. Instead of doing this, I split the data into two separate sets of rows. First set is the rows are “orphan” rows. The rows where the value before it is different and the value after it is different. The other set of rows is the opposite. The rows which are either preceded or followed (or both) by the same value. These sets of consecutive rows allow to speed things up in other ways so it makes sense to store them like that as well. These two groups are stored differently. For a given value the orphan rows are encoded using either 16-bit deltas or 20-bit values. This way each delta/value can be read directly without always having to read how it is encoded. This takes a bit more disk space but makes reading them more efficient. Adding 8-bit deltas as well uses a bit less data, but makes things slower so I’m not currently using them. The sets of consecutive rows are encoded using variable length encoding for the number of consecutive rows and the delta to the next set of consecutive rows.

Caching memory-mapped files

Previously I was opening a file and creating a memory-mapped file from it before each query for all the files used by the query. I noticed in one of the cpu profiles that __munmap was starting to take a bit of time. I changed the code to cache the memory mapped files. This gave quite a large performance benefit to queries which need a lot of data from large files. For example query 31 performance improved about 20% from the caching. Obviously this only works if the process stays alive and uses the same files again, but that is exactly how the benchmarks work. They execute the query three times in a row and take the best time out of those. Using a memory-mapped file cache is clearly allowed in the benchmark methodology since they can be considered source data.

“It is okay if the system performs caching for source data (buffer pools and similar). If the cache or buffer pools can be flushed, they should be flushed before the first run of every query.”

More efficient filtering

When filters are used to exclude most of the rows, it makes sense to avoid reading any data for those rows and skip those rows as quickly as possible. My previous implementation was not very efficient since it was reading whether to skip a row too frequently. If only 1 row is going to be skipping, it might be faster to simply read the value even if it’s not needed. Given some of the data format changes I was able to focus on skipping up to 255 rows at a time when all of those rows contain the same value. An array of 8-bit values (one per row) to indicate how many rows starting from that particular row are going to be skipped. Efficiently populating this array and selectively reading from it made big improvements to queries 30 and 31.

My approach to optimizing code

There are no silver bullets how to make the code run faster. A lot of it simply consists of profiling the code, finding bottlenecks and trying out different things to see if they improve the performance. For every improvement I’ve been able to make I’ve probably done at least 10 changes which didn’t improve anything or made things worse. Often I need to take a step back and try to think outside the box how to improve things more. Staring at the code and profiler output often leads to tweaking the local maximum. To be able to get out of it the code might need to run a bit slower, but leave room for other improvements. Sometimes good ideas don’t necessary work out immediately, but later on they might work better after other changes have been introduced to the code. I have lots of branches and git stashes of things which didn’t work out initially, but I might try them again later.

Obviously I don’t simply try random things. I keep things in mind that should make the code run faster and things that will probably make it run slower. The reality does not always match this though. Compilers can work their magic if the code is structured in a certain way. Accessing memory in different ways might improve things more than writing the most efficient code. I’ve found that it’s good to sometimes go against the theory and simply try out things. If they work, I always try to come up with an explanation why they worked though.

For profiling Rust code I’ve been using https://github.com/mstange/samply and it uses Firefox profiler to show the results in the browser. It works really well on Mac OS and I don’t miss perf one bit. It adds very little if any performance penalty when collecting the samples. It’s even able to show per line results, but these are estimates since the compiler is free to rearrange the code in any way it wants. Despite this, the per-line information has given me insights which basic flame graphs can’t (Firefox profiler has a flame graph view as well and I use it all the time).