5 Comments
User's avatar
Gil Benghiat's avatar

Hi Daniel,

Thank you for the article. We have also seen that many organizations don't want to apply resources to fix data quality issues. DataKitchen provides a free and open-source tool, TestGen, that a motivated individual can use to influence the data suppliers. Check it out here: https://info.datakitchen.io/install-dataops-data-quality-testgen-today. The help page has a great introduction: https://docs.datakitchen.io/articles/#!dataops-testgen-help/introduction-to-dataops-testgen.

-- Gil Benghiat (Founder, VP) @ DataKitchen

Neural Foundry's avatar

The point about starting with schema and constraints before reaching for SaaS tools is so true. I've seen teams spend months evaluating DQ platforms when they still had everythng typed as STRING with nullable columns across the board. Once you nail down proper types and NOT NULL where it matters, you've already solved like 60% of your DQ headaches.

Pipeline to Insights's avatar

Hi Daniel, great post. Just to note that OpenMetadata is open source and Collate is its SaaS version 🙏

David Kershaw's avatar

Love the comparison grid, Daniel. I think we should also spare a little shift-left love for the humble delimited data. I've done some work, and writing, on SQL-DDL-like schemas for CSVs, Excel, JSONL, etc. Have a look at https://blog.csvpath.org/do-schemas-have-a-place-in-delimited-data and I'd love to know what you think.

John Gayton's avatar

This is nitpicking, but using zip code as an integer in the "good" example (leading zero zips, +4, other countries) shows that maybe data quality isn't that simple after all. We can squeeze every last bit of performance out of things with schema definition and minimizing data storage, but it can still backfire in the end. None of that takes away from your point of doing the simple things first. If you can't do the basics, you're certainly not going to do the hard stuff well, and no tool or money magically solves that.