dtype_dispatch: a beautiful hack

2024-11-15
interests: programming, Rust

I've created a monster: a macro that defines two more macros that define enums and match them. But it's saved me about 1000 lines of code, and the more I think about it, the more convinced I am it's the right solution. I reckon it could be useful for various other numerical and dataframe libraries too, e.g. Polars.

The library is dtype_dispatch, but let's start with the story.

About 3.5 years ago, I started working on PancakeDB. This involved generic (specialized) code over data types, like i32, f32, String, etc., and that was no problem. Writing functions like fn compress_generic<T: DataTypeTrait>(column: Vec<T>) -> Vec<u8> {...} was easy enough. But I hit a snag whenever I needed to process data of some compile-time-unknown type. For instance, if a user hands me a row to write to the database, what type does its first field have? To handle all the cases, it should probably be an enum like

#[derive(...)]
enum Field {
  I32(i32),
  F32(f32),
  String(String),
  ...
}

That's good and well, but going between this dynamic type and a specialized one is tedious. I ended up with match clauses everywhere:

#[derive(...)]
enum DataType {
  I32,
  F32,
  String,
  ...
}

#[derive(...)]
enum Column {
  I32(Vec<i32>),
  F32(Vec<f32>),
  String(Vec<String>),
  ...
}

fn to_columns(rows: Vec<Vec<Field>>, schema: Vec<DataType>) -> Vec<Column> {
  let columns: Vec<Column> = schema.iter().map(|dtype|
    match dtype {
      DataType::I32 => Column::I32(Vec::new()),
      DataType::F32 => Column::F32(Vec::new()),
      DataType::String => Column::String(Vec::new()),
      ...
    }
  ).collect();

  // transpose the rows into columns
  for row in rows {
    for (field, column) in row.into_iter().zip(columns.iter_mut()) {
      match (field, column) {
        (Field::I32(value), Column::I32(values)) => values.push(value),
        (Field::F32(value), Column::F32(values)) => values.push(value),
        (Field::String(value), Column::String(values)) => values.push(value),
        ...,
        _ => panic!("shiver me timbers, the data types didn't match")
      }
    }
  }

  columns
}

fn compress_generic<T: DataTypeTrait>(Vec<T>) {...}

fn compress(column: Column) -> Vec<u8> {
  match column {
    Column::I32(values) => compress_generic(values),
    Column::F32(values) => compress_generic(values),
    Column::String(values) => compress_generic(values),
    ...
  }
}

Something like that! Match clauses everywhere! Extremely tedious to work with.

Eventually I closed shop on PancakeDB (entirely because of the match clauses, of course), but I would occasionally still encounter this problem while writing dynamically-typed FFI libraries for Pcodec. I found some unsatisfactory tricks to reduce boilerplate, using true dynamic dispatch with Box<dyn ...> in places and lousy macros in others, but it was still a pain. And ~2 months ago I decided to use dynamic types internally to Pcodec (to support new features and also reduce binary size (each unnecessarily specialized function is duplicated several times in the machine code, you know)). It was the last straw.

I reached the breaking point and searched for a better solution, but found none. I briefly hoped enum_dispatch would save me, but discovered otherwise; it mostly behaves like a stack-allocated alternative to Box<dyn ...>. Neither of these can do type-joining operations like the transpose above, nor can they downcast from dynamic type to generic type¹. And as a code readability enjoyer, I was bothered that these techniques force so much logic into traits and away from the main routines.

So I got real with the problem and found a solution. Somehow I was able to to build a prototype in about 100 lines of macro_rules! (no procedural macros). Abominably, the entire crate is a single macro that takes your data types and generates two more macros:

one to define enums of arbitrary containers of your data types, and
the other to match those enums with concrete type information(!!!)

Here's what all that imaginary PancakeDB code would look like now:

dtype_dispatch::build_dtype_macros!(
  define_an_enum,
  match_an_enum,
  DataTypeTrait,
  {
    I32 => i32,
    F32 => f32,
    String => String,
    ...
  },
);

type Single<L> = L;
define_an_enum!(#[derive(...)] Field(Single));
define_an_enum!(#[derive(...)] DataType);
define_an_enum!(#[derive(...)] Column(Vec));

fn to_columns(rows: Vec<Vec<Field>>, schema: Vec<DataType>) -> Vec<Column> {
  let columns: Vec<Column> = schema.iter().map(|dtype|
    match_an_enum!(dtype,
      DataType<T> => { Column::new(Vec::<T>::new()).unwrap() }
    )
  ).collect();

  // transpose the rows into columns
  for row in rows {
    for (field, column) in row.into_iter().zip(columns.iter_mut()) {
      match_an_enum!(field,
        Field<T>(value) => {
          let values = column.downcast_mut::<T>()
            .expect("shiver me timbers, the data types didn't match");
          values.push(value);
        }
      )
    }
  }

  columns
}

fn compress_generic<T: DataTypeTrait>(Vec<T>) {...}

fn compress(column: Column) -> Vec<u8> {
  match_an_enum!(column,
    Column<T>(values) => { compress_generic(values) }
  )
}

To break this down,

We use dtype_dispatch::build_dtype_macros! to define our two macros define_an_enum and match_an_enum, given the data type name => type mapping.
We then use the first macro to define Field, DataType, and Column.

And finally we use the second macro to generate all our match statements for us. E.g.

match_an_enum!(column,
  Column<T>(values) => { compress_generic(values) }
)

once expanded becomes

match column {
  Column::I32(values) => {
    type T = i32;
    compress_generic(values)
  },
  Column::F32(values) => {
    type T = f32;
    compress_generic(values)
  }
  Column::String(values) => {
    type T = String;
    compress_generic(values)
  }
  ...
}

No repeated code! Easier to read and edit! We can trivially add new data types! In Pcodec I now have 9 data types and 47 matches, so these macros keep me sane.

To be clear, dtype_dispatch isn't better than Box<dyn ...> or enum_dispatch, but it solves a different problem:

	`Box<dyn>`	`enum_dispatch`	`dtype_dispatch`
convert generic -> dynamic	✅	❌¹	✅
convert dynamic -> generic	❌	❌¹	✅
call trait fns directly	⚠️²	✅	❌
match with type information	❌️	❌	✅
stack allocated	❌️	✅	✅
variant type requirements	trait impl	trait impl	container<trait impl>

There are some cursed limitations in dtype_dispatch due to the cursed implementation. For instance, each defined enum must take exactly one #[] clause at the moment. That's because it's rather hard to use repeating groups in a nested macro, so I don't. But with a little motivation, these rough edges could be smoothed out.

TLDR, if you work with data types, this might ease your pain.

Pedantic Details

¹Although enum_dispatch supports From and TryInto, it only works for concrete types (not in generic contexts).

²Trait objects can only dispatch to functions that can be put in a vtable, which is annoyingly restrictive. For instance, traits with generic associated functions can't be put in a Box<dyn>.

< previous next >