Graph all the things
analyzing all the things you forgot to wonder about
interests: compression, data engineering
A friend recently released a beautiful dataset of air quality, made available as a zstd-compressed Parquet file. I was able to losslessly reduce it from 14.9MiB to 4.3MiB via Quantile Compression (QCompress) - a 71% reduction in file size. That's 3.45x the compression ratio!
QCompress arguably offers the biggest cost-reduction opportunity in structured data storage since RCFiles in 2011.
QCompress converts a single, flat, non-nullable column to
It supports integer, float, timestamp, and boolean data types.
Rather than replacing multi-column, nested, nullable formats like Parquet and ORC, it has a likely future as an encoder for them:
QCompress achieves its remarkable compression ratio by specializing.
Think of it like the
.png of numerical data.
No one uses
.gzip (formats designed for text-like data) to compress images, so why do we still use them on the exabytes of structured numerical data across the world?
There are two tricks that make
.qco is so powerful: one familiar and one novel.
The familiar one is delta encoding. Each row in the air quality dataset is a measurement, taken only a few seconds after the previous measurement. As you might expect, it's easier to compress the differences (1st-order deltas) between consecutive timestamps than to compress the timestamps themselves. In some cases, it may be even more effective to compress the differences between those differences (2nd-order deltas, AKA deltas-of-deltas). QCompress innovates by automatically detecting the best delta encoding order for each column.
Delta encoding is common in other applications (e.g. the time series DB Gorilla uses mixed 1st- and 2nd-order deltas), but Parquet and ORC have surprisingly bad support for it:
If we do take the deltas for each column and create a new
.zstd.parquet file, we already get down to 5.2MiB.
Using a Parquet-MR writer to leverage the new delta encoding functionality actually does worse with 8.4MiB because of those bad heuristics I mentioned.
That's a huge improvement from 14.9MiB, but still not on par with the
.qco version's 4.3MiB.
QCompress's second, novel technique is to work with numerical ranges rather than treating bytes as tokens, giving it that extra edge. I discussed this in depth in a previous post. In every nontrivial real-world or synthetic example I've looked at, QCompress improves compression ratio over any (properly delta-encoded) alternative, typically by 20-30%.
The best way is through open source contribution! I am familiar with people using it in industry already, but open source will always bring the highest impact to the largest audience. There is work to be done in both QCompress as well as Parquet and ORC. If you would like to collaborate, you can join on Discord.
Example of installing and using the QCompress CLI to compress the air quality dataset's
timestamp column into a file
out.qco (requires Rust):
% cargo install q_compress_cli
% qcompress compress \
--parquet devinrsmith-air-quality.20220714.zstd.parquet \
--col-name timestamp \
Thank you to my friend at Deephaven for collecting and releasing the air quality dataset!
parquet-tools), rather than by turning each column into its own Parquet file with extra overhead. This is the more fair and lightweight option.
I had forgotten that Parquet V2 introduced a delta encoding, so I had to reword some things and add another file size to compare. I also added the last couple pedantic details.