Skip to main content

Command Palette

Search for a command to run...

Python OOM dataframes

adding Daft

Updated
2 min read
Python OOM dataframes

Prehistory

Oops, that’s me, prehistoric coughs

So I’ve lately read about a benchmark of FireDucks - pandas-like lazy blazingly fast dataframe library (yeah, all the buzzwords are there).

Initial reviews are pretty good, claiming it’s really indecently fast:

This ran in 1 second (with ._evaluate added) on the 55 million row dataset. The classic pandas version took 8 minutes!

However, this GIF had sparked a lot of controversy, even despite Avi Chawla doing a comprehensive overview.

The guys at FD seem to have made a comparison with DuckDB and Polars, too + a dashboard of db-benchmark:

Dundundun DAFTMAN

I’ve recently read about another dataframe library, Daft - and decided to add it using the Colab provided; given it’s rather simple to do for a single comparison.

Behold, results from an infinitetibugged colab:

Or, mean time for DULocation and PULocation, respectively:

LibraryMean, s
Fireducks4.099
Daft5.424
Polars6.155
LibraryMean, s
Fireducks4.26
Polars6.338
Daft5.268

So:

  • Yeah, Fireducks seems the fastest - it’s also not open-source:

By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
https://www.nec.com/en/press/202310/global_20231019_01.html

💡
Daft also exposes a SQL interface which interoperates closely with the DataFrame interface, allowing you to express data transformations and queries on your tables as SQL strings.

So I’ll try doing things with Daft in the nearest future!


Welcome to Teleogenic❣️

Other places I cross-post (not always) to: