Graph all the things

analyzing all the things you forgot to wonder about

Quantile Compression

interests: compression, statistics

As part of a database I'm building, I was interested in compressing numerical data for high cross-computer throughput. I built a rust library and published a package q_compress on to implement a new compression algorithm I made for this purpose. Using q_compress typically compresses columns of numerical data to 25% smaller than any alternative method I could find.

Almost every tech company uses columns of compressed numerical data in their data warehouses and data lakes. But current compression algorithms don't shrink numerical data by as much as they could:

  • There is impressive research on compressing and decompressing integers quickly - using as few as 1.5 CPU cycles per decompression - but with an underwhelming compression ratio. The file format Parquet uses a simplistic compression scheme like this similar to FastPFOR.
  • Sometimes people instead compress numerical data with LZ77 compression algorithms designed for text-like data: gzip, LZ4, Snappy, Zstandard, etc. These algorithms treat bytes like tokens, as if the byte \overline{00000000} is just as different from \overline{00000001} as it is from \overline{10101010}, and they aren't aware of the boundaries between numbers, so they don't compress numerical data very well.
  • It is common to layer both of the previous approaches, producing file types like .snappy.parquet or .gzip.parquet.

We can do better by tailoring a compression algorithm for high compression ratio on numerical data. q_compress's compression and decompression speeds are on par with Snappy and gzip to boot. While creating it, I found some interesting morsels that I'm not sure others have explicitly noticed before.

The idea

We often encode an integer as 64 bits representing a number between -2^{63} and 2^{63} - 1. But numbers like -1429438118075851838 typically show up a lot less often than 0, so by describing common numbers with fewer bits and uncommon numbers with more, we can use fewer bits on average without losing information. That's the principle behind lossless compression.

My idea was to encode each number as a few prefix bits to determine its range, plus a few more offset bits to determine its exact value in that range. A range is simply a lower and upper bound, inclusive. For instance, if our integers follow the geometric distribution P(x) = 2^{-x} where x \in \{1, 2, \ldots 2^{63} - 1\}, we could use the prefix \overline{0} to represent the range 1 to 1 and use the prefix \overline{1} to represent the range 2 to 2^{63} - 1. Then the number 1 would be compressed into a single bit \overline{0}, and any number 2 or higher would be compressed into 64 bits: the prefix bit \overline{1} followed by 63 offset bits. Instead of using 64 bits to represent each integer, this would represent 50% of our integers with 1 bit and the other 50% with 64 bits for an average of 32.5 bits per integer.

Side note: for this geometric distribution, the most efficient compression is: \overline{0}\to1, \overline{10}\to2, \overline{110}\to3, \overline{1110}\to4, and so forth. This requires only 2 bits per number on average, since the long bit sequences are very unlikely. We can actually prove that it is the most efficient compression algorithm since it achieves the lower bound given by the Shannon entropy:

\text{average bits per number} \ge -\sum_x P(x)\log_2(P(x))
which, for this geometric distribution, is also 2.

Here's another integer distribution and the ranges and prefixes q_compress came up with:

A probability distribution and the approximated distribution implied by Quantile Compression's Huffman codes

q_compress solves 2 subproblems to choose good compression parameters: choosing the ranges, and choosing the prefixes for those ranges. It actually makes more sense to explain choosing the prefixes first, so we'll start there.

Choosing the prefixes

After compression, we'll just have a long list of bits, but we need a way of knowing when a prefix ends and when an offset begins. The most efficient way to do this is with a binary tree of prefixes. Consider this binary tree:

binary tree where each edge is a bit and each leaf vertex is a numerical range

We'll follow the tree as we read bits until we reach a leaf node. For instance, say we read in the bits \overline{01011}. Starting at the root node, we follow the first bit 0 to the left node, which is a leaf, and use its range (3 to 6). This range contains 4 numbers, so we just need to read the next 2 bits \overline{10} in binary to get the offset of 2, which corresponds to the number 3 + 2 = 5.

Now we've decoded one number, and the bits \overline{11} remain. We start again at the root node, follow 1 to node 1, then 1 to the bottom middle node, which is a leaf. We use its range (9 to 9), and since that range only contains 1 number, we can return the number 9 + 0 = 9. So the bits \overline{01011} decoded into 5, 9 for this compression scheme.

Choosing the optimal such tree (given our ranges) is a solved problem known as Huffman coding. Since we know the proportion of data that falls into each range, we can put uncommon ranges deep into the tree with a long prefix and describe common ranges easily with a short prefix. Huffman coding uses the minimum number of bits on average by recursively grouping the two least likely nodes (or ranges) under a new node, until all ranges are part of one tree. There are good visualizations of this available. Note that the resulting tree will be a binary search tree over prefixes, but probably not a binary search tree over our numbers.

There's a nice property of Huffman coding I'll refer to later: if each range r_i constitutes a fraction 2^{-c_i} of the data where c_i are integers, then the Huffman tree will give each range a prefix of length c_i. This Huffman tree will also be perfectly efficient. So if half our numbers fall in the range 3 to 6, a quarter fall in the range 7 to 8, and a quarter fall in the range 9 to 9, we'll get the binary tree above, and it will achieve the Shannon entropy limit.

Which ranges to use

First of all, why use ranges at all when we could build a Huffman coding over all numbers that appear in the dataset? The reason is that serializing the Huffman coding and ranges also requires data, so if we have too many distinct numbers, all that metadata would require more bits than the original dataset. Ranges offer a good intermediate where we can build a small Huffman coding that describes most of the detail of the distribution.

Now consider the following distribution. How can we most concisely describe it with ranges?

toy probability distribution

Here are three options, A, B, C, along with the Huffman trees they would require:

A toy probability distribution as one range one range Huffman tree
B toy probability distribution as two ranges two range Huffman tree
C toy probability distribution as three ranges three range Huffman tree

If we assume describing a range and its prefix takes 100 bits of metadata, here is the bit size of each option:

metadata bits prefix bits offset bits total bits
A 100 0 2,200 2,300
B 200 800 1,200 2,200
C 300 1,200 800 2,300

So option B is most efficient. It saved bits over option A by using fewer bits to describe the offset of high-likelihood numbers (5 and 6). And it saved bits over option C by using one fewer range; splitting 5 and 6 into different ranges cost just as many prefix bits as it saved in offset bits, so C effectively only increased metadata size.

As one might glean from this example, it is most efficient to form ranges over numbers with approximately equal frequency. That's where quantiles come in. Most distributions that appear in real life have fairly smooth probability density functions, so if we chunk the distribution into M quantiles representing 1/M of the data each, we can rely on each quantile having a fairly uniform probability density AND constituting a meaningful proportion of the data.

In quantile compression, we choose M=2^d quantile ranges for some integer d. For instance, for d=2 with the distribution above, M=4, so we would start with a range for each quarter of the data (1 to 2, 3 to 4, 5 to 5, and 6 to 6). The library then greedily combines adjacent ranges if doing so will reduce bit size, so it would merge those quarters into 1 to 4 and 5 to 6, equivalent to option B above. By using a number of quantiles that is a power of 2, we improve the Huffman tree on average, due to the nice efficiency property for powers of 2 that I mentioned earlier.

Encoding offsets

Let's say we have a range from 0 and 5. That's 6 possibile offsets, so how can we encode the offset as efficiently as possible? We could use 3 bits, but that covers 2^3=8 possibilities instead.

The best approach is to conditionally use 2 or 3 bits depending on exactly which offset we need. A computationally efficient way to do this for an offset h between 0 and p-1 with 2^k < p < 2^{k+1} is to initially encode the k least significant digits, representing a number

s = \begin{cases}h,&h<2^k \\ h-2^k,&h\ge2^k\end{cases}
Then if s + 2^k \ge p, we know that the most significant digit won't distinguish between two plausible offsets, and we can stop with the k digits we have to far. But if s + 2^k \le p, we need that final, k+1st digit to distinguish.

Overall, we can encode 2^{k + 1} - p of the integers with k bits, and the remaining 2p - 2^{k+1} will need k+1 bits. The average bit cost of encoding an offset is a funny-looking bumpy log plot, equalling \log_2(p) whenever it comes out to an integer:

average number of offset bits versus p, the size of the range

Crushing the competition

The most common compression algorithms people use on columns of numerical data are gzip and Snappy, often combined with Parquet's simple compression scheme. Gzip is considered to compress very well, but slowly. Snappy doesn't compress as well, but does so much faster. They both decompress fairly quickly. They are both based on LZ77, which means they're good at compressing long strings with repeating byte patterns. But at the task of compressing fixed-length numbers, they leave plenty of room for improvement. Here is the compression ratio (uncompressed file size / compressed file size) for each algorithm on various distributions, along with the theoretical ideal:

Compression ratio of different options on heavy-tail integers. Quantile Compression comes near theoretical limit, much better than others. Compression ratio of different options on standard normal floats. Quantile Compression comes near theoretical limit, much better than others.
Compression ratio of different options on sparse integers. Quantile Compression comes near theoretical limit, much better than others. Compression ratio of different options on dollars and cents. Quantile Compression comes near theoretical limit, much better than others.

I used d=6 for the quantile compression level, a mild setting that creates up to M=64 ranges.

For the heavy-tailed integers (a lomax distribution with \alpha=0.5) and standard normal floats, quantile compression outperforms the competition and achieves nearly the theoretical limit.

Standard normal floats is the only floating point distribution I included here, and you may notice that its achievable compression ratio is much lower than the others. This is because of the devilish nature of floating point numbers. For 64 bit floats, 1 bit describes the sign (\pm), 11 bits describe the exponent (e.g. an exponent 12 describes a number with magnitude between 2^{12} and 2^{13}), and 52 "fractional" bits describe where the number is between that exponent and the next one. One implication of this is that those fractional bits are usually all important. Even for a uniform distribution between 1 and 2, a compression ratio of only 64/52 = 16/13 \approx 1.23 can be achieved. All 52 fractional bits are needed, while the 12 sign and exponent bits can be compressed away. And consider the mind-boggling fact that there are 1023 times more floating point numbers between 0 and 1 as there are between 1 and 2. Only for sparse or extremely narrow floating point distributions can a substantial compression ratio can be achieved. I'm not even going to get into the weirdness of floating point infinities, NaNs, and partial ordering.

For sparse integers (99% 0's and 1% 1's), quantile compression again outperforms the competition. It uses a special mode where the prefix for 0 is followed by a (variable-length) integer describing how many repetitions follow before the next 1 appears. There may still be ways I could improve the implementation to achieve nearly the theoretical limit, but this sort of data can already be compressed to quite a small size, so it's not a sticking point.

The "dollars and cents" dataset is something I synthesized to look like retail store prices. Dollars and cents are kept as separate integer columns. Dollars follow a lomax distribution with \alpha=1.5 and median 5, and cents follow a wacky distribution where amounts like 99¢ are more common:

hypothetical probability distribution of cents in retail prices

Another side note: I tried to incorporate empirical data from New York Public Library's fascinating menu dataset dating back to 1840. But the data turned out uninteresting, compression-wise, because most of their restaurants were too upscale to use prices like 99¢.

Quantile compression does a little imperfectly on dollars and cents since the wacky cents distribution isn't smooth, but beats the competition nonetheless. I also devised a total cents distribution via \text{total_cents} = 100 * \text{dollars} + \text{cents}. I chose total cents as an antagonistic distribution for quantile compression, since it has high-frequency information that will partly be missed by the quantiles. Gzip and Snappy should theoretically be able to eek out some compression from the high-frequency information. But to my surprise, quantile compression was still efficient enough to outperform them:

Compression ratio of different options on total cents. Quantile Compression beats others slightly but is somewhat far from theory.

Even in this scenario designed specifically to benefit other methods, quantile compression was the winner.