I was working on a Polars data pipeline recently, one in which a “larger than memory” dataset was being processed. This data pipeline was extremely fast and enabled the processing of a large dataset on a small instance with not much memory. It got me thinking about streaming data and memory consumption.
This concept of reducing memory pressure is an important one in Data Engineering. To build cost-effective and scalable data processing pipelines, memory consumption plays a big part.
It doesn’t matter if you’re using Python or Rust, writing big code or little code, I think at some point we should all stop and think about how we are writing our code that processes data about memory usage.
Keep reading with a 7-day free trial
Subscribe to Data Engineering Central to keep reading this post and get 7 days of free access to the full post archives.