I have found that DuckDB still has a few issues that can cause it to blow up unexpectedly - such as loading a dataframe that is the same number of rows as the type sampling process, and columns with all null/None.
Polars has never failed me. And it is the fastest option for the processing I do. Around half the time as other approaches (pandas, arrow, duckdb) for the same tasks.
For bonus points, try ibis with polars as the backend rather than Duckdb. Bliss!
The ibis syntax is also much cleaner - closer to dplyr.
Good post! I will try polars for my master's degree project. I am currently using spark, but it is just for one large dataset. Maybe polars is a better choice.
I have found that DuckDB still has a few issues that can cause it to blow up unexpectedly - such as loading a dataframe that is the same number of rows as the type sampling process, and columns with all null/None.
Polars has never failed me. And it is the fastest option for the processing I do. Around half the time as other approaches (pandas, arrow, duckdb) for the same tasks.
For bonus points, try ibis with polars as the backend rather than Duckdb. Bliss!
The ibis syntax is also much cleaner - closer to dplyr.
Good post! I will try polars for my master's degree project. I am currently using spark, but it is just for one large dataset. Maybe polars is a better choice.