Discussion about this post

User's avatar
James Corbett's avatar

This is a most misrepresented article on two fronts

1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.

2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.

Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.

Expand full comment
Hampus Londögård's avatar

Hi, you could test combining duckdb and polars as both uses arrow.

I/O using duckdb (support delete vectors) and then transform using polars. Perhaps a good combo?

Expand full comment
14 more comments...

No posts

Ready for more?