Graph all the things
analyzing all the things you forgot to wonder about
2024-11-15
interests: programming, Rust
I've created a monster: a macro that defines two more macros that define enums and match them. But it's saved me about 1000 lines of code, and the more I think about it, the more convinced I am it's the right solution. I reckon it could be useful for various other numerical and dataframe libraries too, e.g. Polars.
The library is dtype_dispatch
, but let's start with the story.
About 3.5 years ago, I started working on PancakeDB.
This involved generic (specialized) code over data types, like i32
, f32
, String
, etc., and that was no problem.
Writing functions like fn compress_generic<T: DataTypeTrait>(column: Vec<T>) -> Vec<u8> {...}
was easy enough.
But I hit a snag whenever I needed to process data of some compile-time-unknown type.
For instance, if a user hands me a row to write to the database, what type does its first field have?
To handle all the cases, it should probably be an enum like
#[derive(...)]
enum Field {
I32(i32),
F32(f32),
String(String),
...
}
That's good and well, but going between this dynamic type and a specialized one is tedious. I ended up with match clauses everywhere:
#[derive(...)]
enum DataType {
I32,
F32,
String,
...
}
#[derive(...)]
enum Column {
I32(Vec<i32>),
F32(Vec<f32>),
String(Vec<String>),
...
}
fn to_columns(rows: Vec<Vec<Field>>, schema: Vec<DataType>) -> Vec<Column> {
let columns: Vec<Column> = schema.iter().map(|dtype|
match dtype {
DataType::I32 => Column::I32(Vec::new()),
DataType::F32 => Column::F32(Vec::new()),
DataType::String => Column::String(Vec::new()),
...
}
).collect();
// transpose the rows into columns
for row in rows {
for (field, column) in row.into_iter().zip(columns.iter_mut()) {
match (field, column) {
(Field::I32(value), Column::I32(values)) => values.push(value),
(Field::F32(value), Column::F32(values)) => values.push(value),
(Field::String(value), Column::String(values)) => values.push(value),
...,
_ => panic!("shiver me timbers, the data types didn't match")
}
}
}
columns
}
fn compress_generic<T: DataTypeTrait>(Vec<T>) {...}
fn compress(column: Column) -> Vec<u8> {
match column {
Column::I32(values) => compress_generic(values),
Column::F32(values) => compress_generic(values),
Column::String(values) => compress_generic(values),
...
}
}
Something like that! Match clauses everywhere! Extremely tedious to work with.
Eventually I closed shop on PancakeDB (entirely because of the match clauses, of course), but I would occasionally still encounter this problem while writing dynamically-typed FFI libraries for Pcodec.
I found some unsatisfactory tricks to reduce boilerplate, using true dynamic dispatch with Box<dyn ...>
in places and lousy macros in others, but it was still a pain.
And ~2 months ago I decided to use dynamic types internally to Pcodec (to support new features and also reduce binary size (each unnecessarily specialized function is duplicated several times in the machine code, you know)).
It was the last straw.
I reached the breaking point and searched for a better solution, but found none.
I briefly hoped enum_dispatch would save me, but discovered otherwise; it mostly behaves like a stack-allocated alternative to Box<dyn ...>
.
Neither of these can do type-joining operations like the transpose above, nor can they downcast from dynamic type to generic type1.
And as a code readability enjoyer, I was bothered that these techniques force so much logic into traits and away from the main routines.
So I got real with the problem and found a solution.
Somehow I was able to to build a prototype in about 100 lines of macro_rules!
(no procedural macros).
Abominably, the entire crate is a single macro that takes your data types and generates two more macros:
Here's what all that imaginary PancakeDB code would look like now:
dtype_dispatch::build_dtype_macros!(
define_an_enum,
match_an_enum,
DataTypeTrait,
{
I32 => i32,
F32 => f32,
String => String,
...
},
);
type Single<L> = L;
define_an_enum!(#[derive(...)] Field(Single));
define_an_enum!(#[derive(...)] DataType);
define_an_enum!(#[derive(...)] Column(Vec));
fn to_columns(rows: Vec<Vec<Field>>, schema: Vec<DataType>) -> Vec<Column> {
let columns: Vec<Column> = schema.iter().map(|dtype|
match_an_enum!(dtype,
DataType<T> => { Column::new(Vec::<T>::new()).unwrap() }
)
).collect();
// transpose the rows into columns
for row in rows {
for (field, column) in row.into_iter().zip(columns.iter_mut()) {
match_an_enum!(field,
Field<T>(value) => {
let values = column.downcast_mut::<T>()
.expect("shiver me timbers, the data types didn't match");
values.push(value);
}
)
}
}
columns
}
fn compress_generic<T: DataTypeTrait>(Vec<T>) {...}
fn compress(column: Column) -> Vec<u8> {
match_an_enum!(column,
Column<T>(values) => { compress_generic(values) }
)
}
To break this down,
dtype_dispatch::build_dtype_macros!
to define our two macros define_an_enum
and match_an_enum
, given the data type name => type
mapping.Field
, DataType
, and Column
.match_an_enum!(column,
Column<T>(values) => { compress_generic(values) }
)
once expanded becomesmatch column {
Column::I32(values) => {
type T = i32;
compress_generic(values)
},
Column::F32(values) => {
type T = f32;
compress_generic(values)
}
Column::String(values) => {
type T = String;
compress_generic(values)
}
...
}
No repeated code! Easier to read and edit! We can trivially add new data types! In Pcodec I now have 9 data types and 47 matches, so these macros keep me sane.
To be clear, dtype_dispatch
isn't better than Box<dyn ...>
or enum_dispatch
, but it solves a different problem:
Box<dyn> |
enum_dispatch |
dtype_dispatch |
|
---|---|---|---|
convert generic -> dynamic | ✅ | ❌1 | ✅ |
convert dynamic -> generic | ❌ | ❌1 | ✅ |
call trait fns directly | ⚠️2 | ✅ | ❌ |
match with type information | ❌️ | ❌ | ✅ |
stack allocated | ❌️ | ✅ | ✅ |
variant type requirements | trait impl | trait impl | container<trait impl> |
There are some cursed limitations in dtype_dispatch
due to the cursed implementation.
For instance, each defined enum must take exactly one #[]
clause at the moment.
That's because it's rather hard to use repeating groups in a nested macro, so I don't.
But with a little motivation, these rough edges could be smoothed out.
TLDR, if you work with data types, this might ease your pain.
1Although enum_dispatch
supports From
and TryInto
, it only works for
concrete types (not in generic contexts).
2Trait objects can only dispatch to functions that can be put in a vtable,
which is annoyingly restrictive.
For instance, traits with generic associated functions can't be put in a
Box<dyn>
.