React

Graph all the things

analyzing all the things you forgot to wonder about

BetterBufRead: Faster Reads in Rust

2024-11-13
interests: fast code, Rust

I made a Rust crate better_io that allows zero-copy reads, and I used it to increase decompression speed in the numerical compression library Pcodec by several %.

Suppose your library needs to read contiguous slices of n bytes, and you want to support any kind of data source. Here are some things you might try:

The Read trait

First, you might try to manage a buffer of length \ge n, accept a Read, and use it to populate that vector. Read allows you to do this:

pub trait Read {
  fn read(&mut self, buf: &mut [u8]) -> Result<usize>;
}

But if a user's Read was already an in-memory byte slice, you'll pay an unnecessary copy from self to buf! Plus, the semantics of Read are lower-level than you want; it might not fill the buffer completely, so you'll need to keep track and call read until it's actually full1.

Fabien Giesen made the same exact same observations about this API in C++ and came up with a similar solution to the one I will propose here.

So Read isn't very useful for data consumers who want to handle in-memory data efficiently, but that's not to say Read is bad! It's an excellent trait to describe the true behavior of files and network streams! All we need is a little middleware to make it nice for the consumer.

The BufRead trait

Next, you might try to accept a BufRead and use its internal buffer instead of writing to an extra buffer. BufRead allows you to do this:

pub trait BufRead: Read {
  fn fill_buf(&mut self) -> Result<&[u8]>;
  fn consume(&mut self, amt: usize);
}

This trait's fill_buf method hands you zero-copy byte slices from both in-memory and streamed data. The only2 problem is: BufRead implementations such as BufReader refuse to cycle/refill their buffers until completely empty, so there's often no way to get the next n bytes contiguously. The result from fill_buf can be just 1 byte long, and there's nothing you can do about it.

In my opinion, BufRead is a bad abstraction. Our data sources are perfectly capable of giving us zero-copy contiguous reads, but BufRead doesn't empower the data consumer.

The solution: BetterBufRead

BetterBufRead essentially offers:

pub trait BetterBufRead {
  fn fill_or_eof(&mut self, n_bytes: usize) -> Result<()>;
  fn buffer(&self) -> &[u8];
  fn consume(&mut self, n_bytes: usize);
  ...
}

Okay, this looks pretty much the same as BufRead, except we decoupled fill_buf into fill_or_eof and buffer. But the crucial difference is its contract: implementations of BetterBufRead must produce a buffer of at least n_bytes contiguous bytes after fill_or_eof whenever possible. They are only allowed to shortchange us once they hit the EOF (end of file).

This contract fulfills all our zero-copy, contiguous ambitions.

Two more subtleties I'm proud of:

  • BetterBufRead allows easily mutating its capacity or creating new instances based on other buffered readers losslessly. You'd think that would be an obvious requirement, since different code paths will require different length reads, but neither BufRead nor its implementation BufReader have a way to do these things3. As a result, BufReader users sometimes create BufReader<BufReader<R>> and so forth to achieve their desired buffering. And that's obviously an unnecessary copy!
  • BetterBufReader encourages zero-copy behavior at compile time. You may have noticed that BufRead implements Read, allowing the user to create those BufReader<BufReader<...>>s if they want. In the BetterBufRead philosophy, this is a mistake, so BetterBufRead does not implement Read. This makes it impossible to construct a BetterBufReader of a BetterBufReader.

Why do you want contiguous slices of n bytes anyway?

To be fast. It's always to be fast.

Compression libraries (and the like) tend to decode many tiny values. Imagine we want to decode 100 bit-packed integers in a loop, and each one is randomly between 1 and 57 bits long. The total number of bytes we need is unpredictable.

You might not like it, but here's an efficient way to read such an integer:

unsafe fn read_u64_at(
  src: &[u8],
  bit_idx: usize,
  bit_len: u32,
) -> u64 {
  let byte_idx = bit_idx / 8;
  let bits_past_byte = bit_idx % 8;
  let raw_bytes = *(src.as_ptr().add(byte_idx) as *const [u8; 8]);
  let byte_aligned_u64 = u64::from_le_bytes(raw_bytes);
  let mask = (1 << bit_len) - 1;
  (byte_aligned_u64 >> bits_past_byte) & mask
}

This has the remarkable property of being shorter in assembly than in source code.

For each integer, we could check if we have enough bytes remaining and perform a read if not, but that branching would slow us down tremendously. Instead, we read enough bytes upfront to guarantee we can decode all the integers (that's 7 + \lceil100 * 57 / 8\rceil = 720 bytes if you care). Having those contiguous bytes, we need not branch, and we can safely unleash our horribly unsafe code while staying segfault-free.

What about writes?

Intriguingly, there is no BufWrite in the Rust stdlib. Nor have I implemented a BetterBufWrite yet.

But I don't think there's any fundamental reason for that - one could design a buffered writer trait that provides a mutable destination byte slice. The only reason I haven't done this is that Pco didn't need it; compression is about 10x slower than decompression, so no one would notice a zero-copy speedup. The only thing standing between BetterBufWrite and existence is a use case.

Wisdom?

Good APIs give control to the user in the ways that matter while abstracting away the rest. Read and BetterBufRead both do this in their own ways. BufRead could be better.

Pedantic Clarifications

Q: You aren't getting n contiguous bytes at EOF anyway, so what do you do then? A: Since EOF happens only once, we make an one-time exception and copy the final incomplete bytes to a large enough buffer. Normally this is like the last 1kB out of a 1MB+ file, so I guess you could call it \epsilon-copy. With tricks like this, I'm never the first person to come up with them, and of course there's another Fabien Giesen blog post to prove it.

1There is a read_exact function, but if you hit the end of the file, you'll be in a pickle. It can leave the buffer full of garbage and return an error, losing not only those final bytes but also the current state of the Read.

2Well, and the other side problems I mention later. Saying "only" just sounded good.

3In particular, you can destroy a BufReader<R> without losing information, but you can't recreate one. Extracting the original Read is an information-destroying method, but you can copy the buffer ahead of time to retain full information. It's just that BufReader provides no way to create a new instance from a pre-populated buffer. BetterBufReader offers one though!