Graph all the things
analyzing all the things you forgot to wonder about
2021-06-21
interests: compression, statistics
As part of a database I'm building, I was interested in compressing numerical data for high cross-computer throughput.
I built a rust library and published a package q_compress
on crates.io to implement a new compression algorithm I made for this purpose.
Using q_compress
typically compresses columns of numerical data to 25% smaller than any alternative method I could find.
Almost every tech company uses columns of compressed numerical data in their data warehouses and data lakes. But current compression algorithms don't shrink numerical data by as much as they could:
.snappy.parquet
or .gzip.parquet
.We can do better by tailoring a compression algorithm for high compression ratio on numerical data.
q_compress
's compression and decompression speeds are on par with Snappy and gzip to boot.
While creating it, I found some interesting morsels that I'm not sure others have explicitly noticed before.
We often encode an integer as 64 bits representing a number between and . But numbers like -1429438118075851838 typically show up a lot less often than 0, so by describing common numbers with fewer bits and uncommon numbers with more, we can use fewer bits on average without losing information. That's the principle behind lossless compression.
My idea was to encode each number as a few prefix bits to determine its range, plus a few more offset bits to determine its exact value in that range. A range is simply a lower and upper bound, inclusive. For instance, if our integers follow the geometric distribution where , we could use the prefix to represent the range 1 to 1 and use the prefix to represent the range 2 to . Then the number 1 would be compressed into a single bit , and any number 2 or higher would be compressed into 64 bits: the prefix bit followed by 63 offset bits. Instead of using 64 bits to represent each integer, this would represent 50% of our integers with 1 bit and the other 50% with 64 bits for an average of 32.5 bits per integer.
Side note: for this geometric distribution, the most efficient compression is: , , , , and so forth. This requires only 2 bits per number on average, since the long bit sequences are very unlikely. We can actually prove that it is the most efficient compression algorithm since it achieves the lower bound given by the Shannon entropy:
which, for this geometric distribution, is also 2.Here's another integer distribution and the ranges and prefixes q_compress
came up with:
q_compress
solves 2 subproblems to choose good compression parameters: choosing the ranges, and choosing the prefixes for those ranges.
It actually makes more sense to explain choosing the prefixes first, so we'll start there.
After compression, we'll just have a long list of bits, but we need a way of knowing when a prefix ends and when an offset begins. The most efficient way to do this is with a binary tree of prefixes. Consider this binary tree:
We'll follow the tree as we read bits until we reach a leaf node. For instance, say we read in the bits . Starting at the root node, we follow the first bit 0 to the left node, which is a leaf, and use its range (3 to 6). This range contains 4 numbers, so we just need to read the next 2 bits in binary to get the offset of 2, which corresponds to the number .
Now we've decoded one number, and the bits remain. We start again at the root node, follow 1 to node 1, then 1 to the bottom middle node, which is a leaf. We use its range (9 to 9), and since that range only contains 1 number, we can return the number . So the bits decoded into for this compression scheme.
Choosing the optimal such tree (given our ranges) is a solved problem known as Huffman coding. Since we know the proportion of data that falls into each range, we can put uncommon ranges deep into the tree with a long prefix and describe common ranges easily with a short prefix. Huffman coding uses the minimum number of bits on average by recursively grouping the two least likely nodes (or ranges) under a new node, until all ranges are part of one tree. There are good visualizations of this available. Note that the resulting tree will be a binary search tree over prefixes, but probably not a binary search tree over our numbers.
There's a nice property of Huffman coding I'll refer to later: if each range constitutes a fraction of the data where are integers, then the Huffman tree will give each range a prefix of length . This Huffman tree will also be perfectly efficient. So if half our numbers fall in the range 3 to 6, a quarter fall in the range 7 to 8, and a quarter fall in the range 9 to 9, we'll get the binary tree above, and it will achieve the Shannon entropy limit.
First of all, why use ranges at all when we could build a Huffman coding over all numbers that appear in the dataset? The reason is that serializing the Huffman coding and ranges also requires data, so if we have too many distinct numbers, all that metadata would require more bits than the original dataset. Ranges offer a good intermediate where we can build a small Huffman coding that describes most of the detail of the distribution.
Now consider the following distribution. How can we most concisely describe it with ranges?
Here are three options, A, B, C, along with the Huffman trees they would require:
A | ||
B | ||
C |
If we assume describing a range and its prefix takes 100 bits of metadata, here is the bit size of each option:
metadata bits | prefix bits | offset bits | total bits | |
A | 100 | 0 | 2,200 | 2,300 |
B | 200 | 800 | 1,200 | 2,200 |
C | 300 | 1,200 | 800 | 2,300 |
So option B is most efficient. It saved bits over option A by using fewer bits to describe the offset of high-likelihood numbers (5 and 6). And it saved bits over option C by using one fewer range; splitting 5 and 6 into different ranges cost just as many prefix bits as it saved in offset bits, so C effectively only increased metadata size.
As one might glean from this example, it is most efficient to form ranges over numbers with approximately equal frequency. That's where quantiles come in. Most distributions that appear in real life have fairly smooth probability density functions, so if we chunk the distribution into quantiles representing of the data each, we can rely on each quantile having a fairly uniform probability density AND constituting a meaningful proportion of the data.
In quantile compression, we choose quantile ranges for some integer . For instance, for with the distribution above, , so we would start with a range for each quarter of the data (1 to 2, 3 to 4, 5 to 5, and 6 to 6). The library then greedily combines adjacent ranges if doing so will reduce bit size, so it would merge those quarters into 1 to 4 and 5 to 6, equivalent to option B above. By using a number of quantiles that is a power of 2, we improve the Huffman tree on average, due to the nice efficiency property for powers of 2 that I mentioned earlier.
Let's say we have a range from 0 and 5. That's 6 possibile offsets, so how can we encode the offset as efficiently as possible? We could use 3 bits, but that covers possibilities instead.
The best approach is to conditionally use 2 or 3 bits depending on exactly which offset we need. A computationally efficient way to do this for an offset between and with is to initially encode the least significant digits, representing a number
Then if , we know that the most significant digit won't distinguish between two plausible offsets, and we can stop with the digits we have to far. But if , we need that final, st digit to distinguish.Overall, we can encode of the integers with bits, and the remaining will need bits. The average bit cost of encoding an offset is a funny-looking bumpy log plot, equalling whenever it comes out to an integer:
The most common compression algorithms people use on columns of numerical data are gzip and Snappy, often combined with Parquet's simple compression scheme. Gzip is considered to compress very well, but slowly. Snappy doesn't compress as well, but does so much faster. They both decompress fairly quickly. They are both based on LZ77, which means they're good at compressing long strings with repeating byte patterns. But at the task of compressing fixed-length numbers, they leave plenty of room for improvement. Here is the compression ratio (uncompressed file size / compressed file size) for each algorithm on various distributions, along with the theoretical ideal:
I used for the quantile compression level, a mild setting that creates up to ranges.
For the heavy-tailed integers (a lomax distribution with ) and standard normal floats, quantile compression outperforms the competition and achieves nearly the theoretical limit.
Standard normal floats is the only floating point distribution I included here, and you may notice that its achievable compression ratio is much lower than the others. This is because of the devilish nature of floating point numbers. For 64 bit floats, 1 bit describes the sign (), 11 bits describe the exponent (e.g. an exponent 12 describes a number with magnitude between and ), and 52 "fractional" bits describe where the number is between that exponent and the next one. One implication of this is that those fractional bits are usually all important. Even for a uniform distribution between 1 and 2, a compression ratio of only can be achieved. All 52 fractional bits are needed, while the 12 sign and exponent bits can be compressed away. And consider the mind-boggling fact that there are times more floating point numbers between 0 and 1 as there are between 1 and 2. Only for sparse or extremely narrow floating point distributions can a substantial compression ratio can be achieved. I'm not even going to get into the weirdness of floating point infinities, NaNs, and partial ordering.
For sparse integers (99% 0's and 1% 1's), quantile compression again outperforms the competition. It uses a special mode where the prefix for 0 is followed by a (variable-length) integer describing how many repetitions follow before the next 1 appears. There may still be ways I could improve the implementation to achieve nearly the theoretical limit, but this sort of data can already be compressed to quite a small size, so it's not a sticking point.
The "dollars and cents" dataset is something I synthesized to look like retail store prices. Dollars and cents are kept as separate integer columns. Dollars follow a lomax distribution with and median 5, and cents follow a wacky distribution where amounts like 99¢ are more common:
Another side note: I tried to incorporate empirical data from New York Public Library's fascinating menu dataset dating back to 1840. But the data turned out uninteresting, compression-wise, because most of their restaurants were too upscale to use prices like 99¢.
Quantile compression does a little imperfectly on dollars and cents since the wacky cents distribution isn't smooth, but beats the competition nonetheless. I also devised a total cents distribution via . I chose total cents as an antagonistic distribution for quantile compression, since it has high-frequency information that will partly be missed by the quantiles. Gzip and Snappy should theoretically be able to eek out some compression from the high-frequency information. But to my surprise, quantile compression was still efficient enough to outperform them:
Even in this scenario designed specifically to benefit other methods, quantile compression was the winner.