As data engineers, most of our functions deal with dataframes. It’d be great to cover how to do testing with polars or pyspark. To me, the added complexity of dataframes makes testing most data pipelines very impractical.
How about "data-aware" tests where we are not testing functionality per se but the data flow itself, e.g if I'm ingesting 100 records do I have these records in my final data-delivery layer or am I missing something? Any good approaches?
As data engineers, most of our functions deal with dataframes. It’d be great to cover how to do testing with polars or pyspark. To me, the added complexity of dataframes makes testing most data pipelines very impractical.
How about "data-aware" tests where we are not testing functionality per se but the data flow itself, e.g if I'm ingesting 100 records do I have these records in my final data-delivery layer or am I missing something? Any good approaches?
There you can see a project template with unit tests and CI/CD for a pipeline in Databricks/Pyspark... https://github.com/andre-salvati/databricks-template
nice!