Graph all the things
analyzing all the things you forgot to wonder about
2024-01-02
interests: compression, data engineering (explained for beginners)
Parquet is an extremely popular format for storing tabular data, used in many major tech companies. I don't have a source for this, but I'm pretty confident there's over an exabyte of data ( bytes) stored in Parquet. Countless millions of dollars are spent storing it, transferring it over network, and processing it each year. And yet Parquet does somewhat poorly at compressing numerical columns.
I forked the Rust implementation of Parquet to demonstrate the Parquet we could have: one with excellent compression for numerical columns. I did this by adding pcodec as an encoding.
I benchmarked against the numerical columns of 3 public, real-world datasets (methodology at bottom):
I argue that compression ratio is usually the most important metric by far. Read throughput is often bottlenecked on network or disk IO, making it the only metric of interest. Even when latency is the primary concern, compression ratio can be more important than decompression speed, since it reduces the time taken to fetch the compressed data. Plus, better compression decreases storage costs.
From these benchmarks, pco's only weak spot is in compression speed, but I have plans to improve it in the future.
Given pcodec, it wasn't terribly complicated to hack it into Parquet. This is not at all meant to be a good or fully-featured implementation, just a simple proof of concept.
To Parquet, pco is a type of encoding, as opposed to a compression. I've mentioned this detail in a previous post about pco's predecessor. This is because pco cares about the data type it's encoding. It goes from (a sequence of numbers -> bytes), as opposed to just (bytes -> bytes).
Encodings have other implications too, like how Parquet decides to split data pages. A fully-featured integration would need require different treatment from Parquet's other encodings on this matter, but that's beyond the scope of this blog post.
Benchmarks were done on a single Apple M3 performance core.
For Parquet I used format version 2, library version 49.0.0. I took a quick look at version 1, but (as expected) version 2 was better in most cases. I used the default settings for everything except chunk size, where I chose after some experimentation to find what worked best for zstd on these datasets.
For pco I used the default compression configuration.
For details about the datasets, see the pcodec repo's benchmarks.
I was asked for some benchmarks comparing the BYTE_STREAM_SPLIT
encoding, which applies to floats, so I did another comparison pictured below.
Only the Taxi dataset had floats, so that's what I used.
I threw in the PLAIN
encoding too.
These are both different from the default of dictionary encoding, which is treated somewhat specially.
Unfortunately arrow-rs doesn't support BYTE_STREAM_SPLIT
yet, so I used pyarrow and could only make a fair comparison for compression ratio.
It may be surprising to some that BYTE_STREAM_SPLIT
does worse than PLAIN
in this case.
The reason is that most columns in the Taxi dataset are decimal-ish; most of their values are approximately multiples of 0.01 or some other base.
As a result, there are obvious patterns in their mantissa bytes that zstd can exploit.
But if we split the floats into 8 separate streams, 1 for each byte, then zstd can't exploit the correlation between mantissa bytes anymore.