Graph all the things
analyzing all the things you forgot to wonder about
2023-12-31
interests: compression, data engineering
I've made an upgrade to Quantile Compression, and I'm calling it pcodec, or pco ("pico") for short.
Just like its predecessor, pco is a format and algorithm for compression of numerical sequences. Columnar and time series data are the most obvious candidates. Pco's biggest improvement over Quantile Compression is in decompression speed, making it competitive in many more regimes. On my machine, it decompresses 2-4GB/s on a single core. Compared to Quantile Compression, it typically gets
Both .qco and .pco allow non-breaking format changes. I could have made .qco compress to a new format as long as the new decompression code could handle both old and new formats. I've made non-breaking changes several times already, but I opted to break in this case because the magnitude of the change was so large. Almost none of the original .qco decompression code could be reused for the new format.
The changes were so large because I've learned a lot about CPU performance in the 2.5 years since I started Quantile Compression.
Here's a comparison of the .qco format on the left vs .pco on the right:
The major format changes include:
Keep the frequently-accessed state of your program small so that it fits in cache. Pco mostly keeps its state under 32kB, which is small enough to fit in most L1 caches.
Do fewer passes over your data. This will reduce how many times it needs to pull your data out of main memory/disk/wherever. And a related phenomenon: would-be performance improvements to your code won't help if you're bottlenecked on memory bandwidth. e.g. If your memory bandwidth is slower than serial addition, SIMD addition won't speed up the addition of large vectors. SIMD is most useful when you have multiple operations to do on your data while it's held in registers or a fast cache.
If you have a register-sized variable that gets modified in your hot loop, ensure it gets held in a register. In some cases, code like this will actually incur instructions to load and store self.foo
on each iteration:
fn do_stuff(&mut self, data: &[...]) {
for x in data {
// do something modifying self.foo
}
}
Even L1 cache has a latency of several cycles, so it can be more efficient to use an explicit stack variable for foo
and write it back at the end:
fn do_stuff(data: ..., state: &mut State) {
let mut foo = state.foo;
for x in data {
// do something modifying foo
}
state.foo = foo;
}
Make the sizes of your data elements nice. Situation 1: your hot loop does random access on a Vec<Foo>
where each Foo
is 7 bytes. Your program might calculate i * 7
to get the memory address it needs, which adds a few cycles of latency to the loop. You can probably improve performance by changing your struct's alignment to 8 bytes, allowing your processor to effectively compute i << 3
instead; shifting is faster than multiplication. Situation 2: you're looping over j
and reading a[j]
and b[j]
on each iteration, where a
holds 32-bit ints and b
holds 64-bit ints. Your program might increment the memory address for each of these separately, since the increment is different. If it's an option, you may see a small speedup by changing a
and b
to hold the same data type, saving your processor an instruction on every iteration.
Look at the assembly code. This is the most important learning, because it generates new learnings and informs you of whether your code has some of the above maladies. This has happened to me several times: I look at the Rust code, think it couldn't get any faster, check the assembly code, realize it's wasting a few instructions on something silly, tweak the Rust code, and see a performance improvement. It's a bit laborious, and it requires learning to read assembly, but it's essential to answer my "why" questions. And the good news is that you only need to read assembly, never write it. Once I learned what I needed for x86_64, it was pretty easy to adjust to AArch64.
I mentioned a few more performance things, including branching, in this reddit post.
Many of the reasons are related to some of the format changes I mentioned above:
As I simplified decompression, reading and understanding the assembly became more tractable, allowing me to invest in the previous section's learnings.
During early development of Quantile Compression I encountered a challenge: millisecond timestamps in a microsecond-precise data type. They were all multiples of 1000, and any good codec should exploit that. So I made an amendment to the formula for a number. Instead of
we now had This posed one new problem and failed to fix an old one.The new problem: multiplication is kinda slow. I worked around this by making specialized code: one loop for when all the GCDs are 1, and another for when they aren't (via a Rust generic type). You only pay for multiplication if you use it.
The old problem: decimal floats still got bad compression. Sometimes people compress floats that are all approximately(!) multiples of (say) 0.01. Or . The GCD approach couldn't handle that.
For pco, I found an approach that kept the code simple, handled both integer and float multipliers, and can be extended to many new types of special data if I become aware of them. Instead of changing the classic formula for combining offsets and bin lower bounds, we decompose each number into multiple latent variables. The new formula for a number with a GCD becomes
where and can be obtained by the classic formula: And I've already mentioned the new formula for float multiples in a previous section.Very often, a latent variable is trivial, like in the case of microsecond-precise millisecond timestamps.
Instead of fancy generics, pco turns these off with an if
check.
Thanks to batching, checks like this can run outside of the hot loop without hurting performance.
And while Quantile Compression had hard-coded the special case of GCDs, pco is robust to future change. Adding a new mode requires no meaningful change to the format or data structures.