4 Comments

As data engineers, most of our functions deal with dataframes. It’d be great to cover how to do testing with polars or pyspark. To me, the added complexity of dataframes makes testing most data pipelines very impractical.

Expand full comment

How about "data-aware" tests where we are not testing functionality per se but the data flow itself, e.g if I'm ingesting 100 records do I have these records in my final data-delivery layer or am I missing something? Any good approaches?

Expand full comment

There you can see a project template with unit tests and CI/CD for a pipeline in Databricks/Pyspark... https://github.com/andre-salvati/databricks-template

Expand full comment