Graph all the things
analyzing all the things you forgot to wonder about
2024-11-13
interests: fast code, Rust
I made a Rust crate better_io that allows zero-copy reads, and I used it to increase decompression speed in the numerical compression library Pcodec by several %.
Suppose your library needs to read contiguous slices of bytes, and you want to support any kind of data source. Here are some things you might try:
First, you might try to manage a buffer of length , accept a Read
, and use it to populate that vector.
Read
allows you to do this:
pub trait Read {
fn read(&mut self, buf: &mut [u8]) -> Result<usize>;
}
But if a user's Read
was already an in-memory byte slice, you'll pay an unnecessary copy from self
to buf
!
Plus, the semantics of Read
are lower-level than you want; it might not fill the buffer completely, so you'll need to keep track and call read
until it's actually full1.
Fabien Giesen made the same exact same observations about this API in C++ and came up with a similar solution to the one I will propose here.
So Read
isn't very useful for data consumers who want to handle in-memory data efficiently, but that's not to say Read
is bad!
It's an excellent trait to describe the true behavior of files and network streams!
All we need is a little middleware to make it nice for the consumer.
Next, you might try to accept a BufRead
and use its internal buffer instead of writing to an extra buffer.
BufRead
allows you to do this:
pub trait BufRead: Read {
fn fill_buf(&mut self) -> Result<&[u8]>;
fn consume(&mut self, amt: usize);
}
This trait's fill_buf
method hands you zero-copy byte slices from both in-memory and streamed data.
The only2 problem is: BufRead
implementations such as BufReader
refuse to cycle/refill their buffers until completely empty, so there's often no way to get the next bytes contiguously.
The result from fill_buf
can be just 1 byte long, and there's nothing you can do about it.
In my opinion, BufRead
is a bad abstraction.
Our data sources are perfectly capable of giving us zero-copy contiguous reads, but BufRead
doesn't empower the data consumer.
BetterBufRead
essentially offers:
pub trait BetterBufRead {
fn fill_or_eof(&mut self, n_bytes: usize) -> Result<()>;
fn buffer(&self) -> &[u8];
fn consume(&mut self, n_bytes: usize);
...
}
Okay, this looks pretty much the same as BufRead
, except we decoupled fill_buf
into fill_or_eof
and buffer
.
But the crucial difference is its contract: implementations of BetterBufRead
must produce a buffer of at least n_bytes
contiguous bytes after fill_or_eof
whenever possible.
They are only allowed to shortchange us once they hit the EOF (end of file).
This contract fulfills all our zero-copy, contiguous ambitions.
Two more subtleties I'm proud of:
BetterBufRead
allows easily mutating its capacity or creating new instances based on other buffered readers losslessly.
You'd think that would be an obvious requirement, since different code paths will require different length reads, but neither BufRead
nor its implementation BufReader
have a way to do these things3.
As a result, BufReader
users sometimes create BufReader<BufReader<R>>
and so forth to achieve their desired buffering.
And that's obviously an unnecessary copy!BetterBufReader
encourages zero-copy behavior at compile time.
You may have noticed that BufRead
implements Read
, allowing the user to create those BufReader<BufReader<...>>
s if they want.
In the BetterBufRead
philosophy, this is a mistake, so BetterBufRead
does not implement Read
.
This makes it impossible to construct a BetterBufReader
of a BetterBufReader
.To be fast. It's always to be fast.
Compression libraries (and the like) tend to decode many tiny values. Imagine we want to decode 100 bit-packed integers in a loop, and each one is randomly between 1 and 57 bits long. The total number of bytes we need is unpredictable.
You might not like it, but here's an efficient way to read such an integer:
unsafe fn read_u64_at(
src: &[u8],
bit_idx: usize,
bit_len: u32,
) -> u64 {
let byte_idx = bit_idx / 8;
let bits_past_byte = bit_idx % 8;
let raw_bytes = *(src.as_ptr().add(byte_idx) as *const [u8; 8]);
let byte_aligned_u64 = u64::from_le_bytes(raw_bytes);
let mask = (1 << bit_len) - 1;
(byte_aligned_u64 >> bits_past_byte) & mask
}
This has the remarkable property of being shorter in assembly than in source code.
For each integer, we could check if we have enough bytes remaining and perform a read if not, but that branching would slow us down tremendously. Instead, we read enough bytes upfront to guarantee we can decode all the integers (that's bytes if you care). Having those contiguous bytes, we need not branch, and we can safely unleash our horribly unsafe code while staying segfault-free.
Intriguingly, there is no BufWrite
in the Rust stdlib.
Nor have I implemented a BetterBufWrite
yet.
But I don't think there's any fundamental reason for that - one could design a buffered writer trait that provides a mutable destination byte slice.
The only reason I haven't done this is that Pco didn't need it; compression is about 10x slower than decompression, so no one would notice a zero-copy speedup.
The only thing standing between BetterBufWrite
and existence is a use case.
Good APIs give control to the user in the ways that matter while abstracting away the rest.
Read
and BetterBufRead
both do this in their own ways.
BufRead
could be better.
Q: You aren't getting contiguous bytes at EOF anyway, so what do you do then? A: Since EOF happens only once, we make an one-time exception and copy the final incomplete bytes to a large enough buffer. Normally this is like the last 1kB out of a 1MB+ file, so I guess you could call it -copy. With tricks like this, I'm never the first person to come up with them, and of course there's another Fabien Giesen blog post to prove it.
1There is a read_exact
function, but if you hit the end of the file, you'll be in a pickle.
It can leave the buffer full of garbage and return an error, losing not only those final bytes but also the current state of the Read
.
2Well, and the other side problems I mention later. Saying "only" just sounded good.
3In particular, you can destroy a BufReader<R>
without losing information, but you can't recreate one.
Extracting the original Read
is an information-destroying method, but you can copy the buffer ahead of time to retain full information.
It's just that BufReader
provides no way to create a new instance from a pre-populated buffer.
BetterBufReader
offers one though!