Data Validation for Data Engineers

Beyond Data Quality

Jul 02, 2024

As the years pass by in a dizzying blur of data tools, companies, and new ideas, I’m still taken aback by the lack of Data Quality in the Data World. It just has not become mainstream yet.

Sure, a precious few DQ tools are bouncing around like Soda or Great Expectations, those being still fairly new and unadopted by the madden masses of Data Engineers on the whole. It does make a guy wonder, how long can you kick against the goad?

If folks wanted some Data Quality they would put it high on the priority list, allocate resources, and get it done. But they don’t. At least not until they have nothing else to do.

Classic.

We are not going to dive into the entire DQ implementation discussion, but just a subset of that theory, namely Data Validation as an option for Data Engineers looking to dip their parches and weary feet in the cool and cold stream of Data Quality.

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

Recently, Prefect announced the open-source technical preview of Prefect 3.0.

Data teams face a relentless challenge. Modern businesses rely on a complex web of workflows and automations, making it critical to build resilient systems that are easy to trust. Prefect 3.0 is our answer to that challenge, providing a framework for building workflows that are resilient by design.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Data Validation

So if you cannot go full bore with something like Databricks DLT Validations, Great Expectations, or Soda … there are still plenty of ways to implement some sort of Data Validation with a reasonable amount of effort.

We can call this a poor man’s Data Quality.

If you’re using the fancy tools we mentioned before, the sky is the limit for Data Validation and Quality, you can even get alerts when data points drift out of the norm.

But, if you are like me and have been around Data for a long time, we all know that 99% of the Data problems that arise could be solved with very simple Data Validation checks. No rocket science needed.

Schema Validation
WAP Pattern
Uniqueness and Deduplication
NULLS

Schema Validation

With the rise of the Lake House and tools like Delta Lake and Iceberg, schema validation has become the simplest form of Data Validation available to every single Data Engineer on pretty much every platform.

For some reason, schemas get ignored, along with most Data Modeling, yet they can provide the most upfront and reliable baseline to control data possible.

If you think about Data Engineering in general, much of what we do is play with Data Types. They are important and have an impact on every aspect of data pipelines and business.

Add and enforcing schemas against your data can solve many common problems.

Protect against malformed incoming data.
You can reason more firmly about data and needed transformations.
Ensure buggy data writes cannot happen.

One thing is for sure, Data Teams who do not enforce schema validation checks via something like Delta Lake … will end up with dirty and malformed data without a doubt.

You can control many things with schemas beyond the simple “is this is a string or a number.”

NULL is acceptable or not
Precision of decimals or floats
Correct date or timestamp formats
Constraints like VALUES IN …

A well-thought-out and written Schema can avoid many common data quality issues … and the best part is that it happens immediately upon data ingest.

Uniqueness and Deduplication

The next most common problem that exists in 80% of Data Platforms is uniqueness and duplication problems. These are probably the most widespread and insidious issues that can arise and cause serious problems.

When it comes to business logic and analytics you can bet your bottom dollar that most of the issues that arise and drive people crazy are because of duplicates and data that is not unique.

Every single data pipeline should include data deduplication checks on the front and the backend of the processing. Many times before important analytical (groupBY etc.) functions are called, deduplication should be done.

Most duplication problems occur because of bad JOINS.

You can solve them with Window functions many times, also most tooling comes with built-in duplicate checks.

Uniqueness is a tricky topic and can be separated from duplicates because it has more business context wrapped up in it.

Typically a Data Engineer working on any dataset should first spend time understanding what makes a record unique and then dedup on that definition.

If you spend time to dedup and understand the uniqueness of a dataset, most of your problems will disappear.

NULLs

Why is it all the most common data problems seem to be simplest to deal with and the most often passed over? NULLs. Some people like them, some people hate them.

Even Reddit talks about NULLs.

I feel like NULLs, after decades of Data Warehousing, still leave a divided field behind them, people just can’t agree on how to handle them.

Do you allow NULLs to occur in the first place?
Do you think about NULLs when aggregating data?

If you are processing data for some unique business logic, it’s imperative to ensure that you think through the fact that NULLs can or do exist, and what effect they will have on your transformation(s).

Conditional Logic: Apply conditional logic in data transformation processes to handle NULLs (e.g., using SQL COALESCE function or equivalent in other languages).
Aggregation: Handle NULLs appropriately during aggregation to ensure accurate results (e.g., using COUNT instead of COUNT(*) in SQL).

Understanding and running tests on how NULLs affect certain common transformations is probably worthy of its own blog post.

WAP Pattern.

This last one, WAP (Write-Audit-Publish) is the amalgamation of all the Data Validations you can think of pilled into one “thing.” Surprisingly there seem to be very few Data Teams that adopt WAP as a way to do Data Validation and Quality checks.

What is WAP?

Write: Ingest data from various sources and prepare it for its final state.
Audit: Validate and ensure the integrity and quality of the data through deduplication, anomaly detection, business logic checks, etc.
Publish: Approve and move the data to production environments like Data Lakes or Data Warehouses.

The benefits of a good WAP pattern ensure high data quality and integrity, as well as reducing data bugs and problems in production. It can also centralize data quality checks and validations into a single spot.

Challenges with WAP can be seen with high costs associated with writing, storing, and reading large amounts of data. Increased compute and storage expenses, particularly in cloud environments.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Thoughts

In the end you don’t always have to have some fancy Data Quality tool. Most of the time if poor data quality exists, it’s because many of the above basic steps were not put in place from the beginning, causing data quality issues to arise.

Starting your Data Engineering projects with a Data Validation mindset from the start, say the schema, for example, using WAP patterns, dealing with NULLs, and stopping duplications from happening … those things will reduce 99% of DQ issues.