Unit Testing for Data Engineers

Feb 19, 2024

Do it. Or Not.

4 Comments

Feb 20, 2024

As data engineers, most of our functions deal with dataframes. It’d be great to cover how to do testing with polars or pyspark. To me, the added complexity of dataframes makes testing most data pipelines very impractical.

Expand full comment

Adrian Pasek

Feb 20, 2024

How about "data-aware" tests where we are not testing functionality per se but the data flow itself, e.g if I'm ingesting 100 records do I have these records in my final data-delivery layer or am I missing something? Any good approaches?

Expand full comment

André Salvati

Feb 19, 2024Edited

There you can see a project template with unit tests and CI/CD for a pipeline in Databricks/Pyspark... https://github.com/andre-salvati/databricks-template

Expand full comment

Chanukya

Feb 19, 2024

nice!

Expand full comment