Jump to ratings and reviews
Rate this book

Automating Data Quality Monitoring: Scaling Beyond Rules with Machine Learning

Rate this book
The world's businesses ingest a combined 2.5 quintillion bytes of data every day. But how much of this vast amount of data--used to build products, power AI systems, and drive business decisions--is poor quality or just plain bad? This practical book shows you how to ensure that the data your organization relies on contains only high-quality records. Most data engineers, data analysts, and data scientists genuinely care about data quality, but they often don't have the time, resources, or understanding to create a data quality monitoring solution that succeeds at scale. In this book, Jeremy Stanley and Paige Schwartz from Anomalo explain how you can use automated data quality monitoring to cover all your tables efficiently, proactively alert on every category of issue, and resolve problems immediately. This book will help

217 pages, Paperback

Published February 13, 2024

3 people are currently reading
8 people want to read

About the author

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
3 (37%)
4 stars
3 (37%)
3 stars
2 (25%)
2 stars
0 (0%)
1 star
0 (0%)
Displaying 1 - 3 of 3 reviews
Profile Image for Emre Sevinç.
178 reviews443 followers
March 23, 2025
This book provides the reader with a pretty good overview of data quality. In many large enterprises a data officer or a data architect will not build data quality monitoring from scratch, but rather deploy a tool dedicated to data quality monitoring. Whatever tool you pick, the concepts, challenges and pitfalls described in this book will appear in your data quality journey.

Some parts of the book might read as advertisement because the authors are founders of a company developing a data quality monitoring software, but that's not a big negative point necessarily, on the contrary, it's good to read about lessons learned. Having said that, I don't find the Python code snippets they've provided very meaningful, but nevertheless they illustrate the concepts concretely (though in a very simplistic manner, not more than that).

One thing that's not stressed is the relationship with Data Catalogs, again, in a large enterprise data landscape, you'll need some sort of data catalog (and mostly you'll have a mess of unrelated data catalogs built with different technologies, independent of each other) and data quality software will need to integrate with that data catalog.

I can easily recommend this book to data officers, data architects and data engineers working in large enterprise settings or even start-ups and scale-ups that rely on the quality of data that are produced by their data pipelines and workflows.
Profile Image for Amit Ranjan.
6 reviews
November 8, 2024
1. Ch-1,2,3 and the appedix on common data quality issues are relatable.
2. The concepts of Ch-4, and 5 on using unsupervised ML (like tree based classification) are interesting but would take a while to implement. These were new to me and if I can time travel back, I'd give it a try in my project.
3. The last 3 chapters are also meaningful about reducing alert fatigue and integration with other solutions like catalog.

Overall, with less than 200 pages of content, it is a good read and contains a number of (short) data quality ancedotes.
Profile Image for Ethan J.
363 reviews11 followers
January 5, 2025
serves as good intro but not too much beyond that.
Displaying 1 - 3 of 3 reviews

Can't find what you're looking for?

Get help and learn more about the design.