The Last Mile of Performance (or so I thought)

This is another update on my personal hobby project to build an analytical database. A month ago I was getting close to beating all the other databases as documented in https://tuomaspelkonen.com/2023/08/27/doubling-down-on-performance/. I’m happy to say that I was able to squeeze even more performance out of my solution and it’s finally faster than all the other databases listed in https://benchmark.clickhouse.com/ despite running on inferior hardware.

Updated Benchmarks

Query	My solution 8/23 M1 Pro	My solution 9/23 M1 Pro	Best Benchmark c6a.metal	Relative performance	Clickhouse M1 Pro (relative to mine)
18	418 ms	393 ms	440 ms	0.89x	2.27 s (5.8x)
30	94 ms	86 ms	90 ms	0.96x	383 ms (4.5x)
31	138 ms	124 ms	140 ms	0.89x	509 ms (4.1x)
32	547 ms	503 ms	510 ms	0.99x	> 6 s (> 10x)

Optimizations

The biggest improvement to the performance came from how unsigned 128-bit values are constructed. In queries 30, 31 and 32 a single bit 128-bit value is used to represent all aggregated values (COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth)). Those all fit nicely in 128 bits with bits to spare for 4B rows. Turns out that in Rust u128 math operations (+, <<) are quite slow compared to u64 operations. Instead of adding a 8-bit or 16-bit value directly into a 128-bit value, it’s faster to add it to the the lower or higher (or partially to both) 64-bit value first and then replace the lower or higher (or both) 64-bit value with that new value. In Rust a value must be converted to the right type before addition can happen. So adding a u8 to and u128 means that u8 must be converted to u128 first. Since CPUs use 64-bit architecture (at least in my case), it’s much faster to convert the u8 to u64 first and add it to u128 that way.

Performance surprise

My original plan was to add very limited SQL support after the performance numbers match or beat the benchmarks. This is exactly what I did, but after adding the support, I discovered some group by queries which were not beating Clickhouse as much as I expected. I was only beating Clickhouse on my laptop by 1.5-2x when I’m used to beating it more than 3x.

Most of the group by queries in the benchmark are done on data with millions of unique values. This is not necessarily a representative sample of group by queries when using an analytical database. I’ve done plenty of group by queries in my time where the number of unique values have been less than a thousand. For example grouping by source and destination region when investigating latencies.

This discovery meant that I had more performance work to do in order to improve the speed of group by queries with limited number of unique values. So that’s where my my time has been going lately and the performance has improved quite significantly since I started this work.

The Setup

Now with the very limited SQL support I was able to create list of queries with two group by columns and a simple count aggregate. I chose columns with different amounts of unique values and different types to make sure the performance improves for all types of columns.

SELECT <Column1>, <Column2>, COUNT(*) FROM hits GROUP BY <Column1>, <Column2> ORDER BY COUNT(*) DESC LIMIT 10;

I ran the query with the following columns so that each column was paired with every other column twice (once as Column1 and once as Column2). This produced 72 (9 * 8) queries to measure the performance of very basic two column group by queries. The total number of rows in the dataset is about 100 million.

UserID, 17630976 unique values, i64 SearchPhrase, 6019103 unique values, string
IPNetworkID, 89907 unique values, u32 WindowName, 65548 unique values, i32 RegionID, 9040 unique values, u32 ResolutionHeight, 3244 unique values, u16 ResolutionWidth, 2159 unique values, u16 SearchEngineID, 96 unique values, u8 IsRefresh, 2 unique values, u8 (bool)

The Results

Results are from the 2nd run to make them as comparable as possible. Results from the 1st run are harder to compare because sometimes the data is already in the disk cache, but not always. The first run for Clickhouse can take twice as long as the second run.

Clickhouse: 27 seconds

My solution: 5.2 seconds

Performance Notes

Other than the speed one big difference is memory usage. Clickhouse server is running in the background before the queries are run for the second time and the first run leaves the server running with about 8GB of memory. My solution has nothing running in the background when the second run is started. The operating system and/or disk has cached all the files but there are no processes running. Despite this my solution runs in less than 1/5th of the time and uses only 1.8GB of memory.

You might notice that my solution prints the columns in the opposite order at the end. This is because I always choose the column with the most unique values as the first column and I haven’t bothered switching them back when printing the results yet.

The Approach

As explained in https://tuomaspelkonen.com/2023/07/26/building-my-own-analytical-dbms/ I’m merging sorted vectors for group by queries. Unique values for each column are stored in a sorted order, but since there are two columns involved, the data needs a bit of massaging before it can be merged. The strategy is to build a sorted vector of unique value pairs by using the column with most unique values as the first value of the pair. Each value from the first column -> rows which have that value -> values for the second column. To be able to build this vector in a sorted order there are two options for each unique value from the first column: a) collect all the corresponding values from the second column and sort them or b) store the values from the second column in an array with predefined slots for each value (unique value index to be specific). Option a) works better when there are a lot more possible values for the second column than rows for the given first column value. Otherwise option b) is chosen. My previous solution did mostly option a) for everything and the sorting started to be a bottleneck when the number of unique values decreased. Adding option b) improved the performance significantly since it avoids sorting completely although it does have to iterate through all the possible value slots for the second column.