Reducing Parquet Size 71% with Quantile Compression

2022-07-17
interests: compression, data engineering

A friend recently released a beautiful dataset of air quality, made available as a zstd-compressed Parquet file. I was able to losslessly reduce it from 14.9MiB to 4.3MiB via Quantile Compression (QCompress) - a 71% reduction in file size. That's 3.45x the compression ratio!

QCompress arguably offers the biggest cost-reduction opportunity in structured data storage since RCFiles in 2011.

Bar chart of data size of different columns in .zstd.parquet vs .qco. The .qco columns are always smaller, often by a large margin.

QCompress converts a single, flat, non-nullable column to .qco format. It supports integer, float, timestamp, and boolean data types. Rather than replacing multi-column, nested, nullable formats like Parquet and ORC, it has a likely future as an encoder for them:

diagram of how QCompress would be an encoder, not a codec, in popular multi-column formats

Why is QCompress so effective?

QCompress achieves its remarkable compression ratio by specializing. Think of it like the .png of numerical data. No one uses .zstd, .snappy, or .gzip (formats designed for text-like data) to compress images, so why do we still use them on the exabytes of structured numerical data across the world?

There are two tricks that make .qco is so powerful: one familiar and one novel.

The familiar one is delta encoding. Each row in the air quality dataset is a measurement, taken only a few seconds after the previous measurement. As you might expect, it's easier to compress the differences (1st-order deltas) between consecutive timestamps than to compress the timestamps themselves. In some cases, it may be even more effective to compress the differences between those differences (2nd-order deltas, AKA deltas-of-deltas). QCompress innovates by automatically detecting the best delta encoding order for each column.

Delta encoding is common in other applications (e.g. the time series DB Gorilla uses mixed 1st- and 2nd-order deltas), but Parquet and ORC have surprisingly bad support for it:

They only support integers (including some timestamps, but notably excluding floats).
They only support 1st-order deltas.
Parquet-MR has awful heuristics for deciding whether to use delta encoding or not, which means it often uses inferior dictionary encoding or applies deltas in cases when they increase file size up to 50%.
ORC only uses delta encoding for monotonic sequences (so none of the air particle columns in this dataset).

If we do take the deltas for each column and create a new .zstd.parquet file, we already get down to 5.2MiB. Using a Parquet-MR writer to leverage the new delta encoding functionality actually does worse with 8.4MiB because of those bad heuristics I mentioned. That's a huge improvement from 14.9MiB, but still not on par with the .qco version's 4.3MiB.

QCompress's second, novel technique is to work with numerical ranges rather than treating bytes as tokens, giving it that extra edge. I discussed this in depth in a previous post. In every nontrivial real-world or synthetic example I've looked at, QCompress improves compression ratio over any (properly delta-encoded) alternative, typically by 20-30%.

How can we start leveraging QCompress?

The best way is through open source contribution! I am familiar with people using it in industry already, but open source will always bring the highest impact to the largest audience. There is work to be done in both QCompress as well as Parquet and ORC. If you would like to collaborate, you can join on Discord.

If you can't wait for the slow open source adoption cycle, QCompress is available as a Rust library with basic JVM bindings, as well as a CLI.

Example of installing and using the QCompress CLI to compress the air quality dataset's timestamp column into a file out.qco (requires Rust):

% cargo install q_compress_cli
% qcompress compress \
  --parquet devinrsmith-air-quality.20220714.zstd.parquet \
  --col-name timestamp \
  out.qco

Acknowledgement

Thank you to my friend at Deephaven for collecting and releasing the air quality dataset!

Pedantic Details

I measure Parquet column size within the original Parquet file (via parquet-tools), rather than by turning each column into its own Parquet file with extra overhead. This is the more fair and lightweight option.
About 99.98% of this Parquet file comes from column data, so there's no need to correct for the (valuable) metadata Parquet includes.
Parquet and ORC are notoriously slow open source projects, so there's no guarantee they will ever adopt QCompress. It almost makes me want to introduce a new competing format.
QCompression actually has another novel innovation of automatic GCD detection. E.g. if all your numbers in a certain range are congruent modulo 7, QCompress will find out and leverage that.

Update: 2022-07-17

I had forgotten that Parquet V2 introduced a delta encoding, so I had to reword some things and add another file size to compare. I also added the last couple pedantic details.

< previous next >