Designing Machine Learning An Iterative Process for Production-Ready Applications is purchased directly from the publisher or approved distributor and spiraled by a 3rd party. Seller is not affiliated with, endorsed by, or pre-authorized by the publisher or author for the spiral listing. Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements. Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references. This book will help you tackle scenarios such Engineering data and choosing the right metrics to solve a business problem Automating the process for continually developing, evaluating, deploying, and updating models Developing a monitoring system to quickly detect and address issues your models might encounter in production Architecting an ML platform that serves across use cases Developing responsible ML systems
Update: My debut novel, Entanglements That Never End, is scheduled for release later in 2025, with an early edition currently available on Kindle. I had a lot of fun writing this story, and I hope you’ll have fun reading it too! ***
I’m Chip Huyen, a writer and computer scientist. I grew up chasing grasshoppers in a small rice-farming village in Vietnam.
I'm interested in AI for storytelling and roleplaying. Previously, I built machine learning tools at NVIDIA and Netflix. I've also founded and sold a company.
I graduated from Stanford, where I taught ML Systems. The lectures became the foundation for the book Designing Machine Learning Systems, which is an Amazon #1 bestseller in AI and has been translated into 10+ languages (very proud)!
My new book AI Engineering (2025) is currently the most read book on the O’Reilly platform. It’s also available on Amazon and Kindle.
In my free time, I travel and write. After high school, I went to Brunei for a 3-day vacation which turned into a 3-year trip through Asia, Africa, and South America. During my trip, I worked as a Bollywood extra, a casino hostess, and a street performer.
As an ML engineer or a Data Scientist, that’s exactly what you need to deploy ML models and maintain them in production. I am currently working on an internal ML platform, and the books resonates very well with the discussions that we are having among Data Scientists, managers, and engineers. How do we retrain models? How often? How to detect data drift and alert on it? Dow e need to have a separate ML platform team to deploy models or should we demand this from Data Scientists? The book discusses this all in detail. I like that the accent is made on principles rather than on tools. Although some tools are covered in the book.
One of the most comprehensive books in MLOps. Start reading this book to understand more about model deployment, and I am satisfied with the content.
Some notes for myself: - It takes time to start from development -> production. Setting up a CI/CD, and auto-update for models is tremendous work since the tools are not quite mature at the moment. - Tried MLFlow and compare with Kubeflow - Common pattern: ML workloads are to do training on GCP or Azure, and deployment on AWS - Observability is part of monitoring. Metric, logs and traces are important, also logs, dashboards, and alerts.
Fantastic book. I would recommend this to people who have a grasp on traditional machine learning algorithms and an understanding of neural networks and want to have the mindset of a machine learning engineer in production.
Very technical, but still accessible and well written textbook on ML systems. It goes very deep on the infra, almost DEVops stuff, but it was expected (the author is a ML eng). It is a great complement to conventional data science books, which focus primarily on algorithms and data manipulation.
NOTES MLOps comes from DevOps short for developments and operations. To operationalize something means to bring it into production which includes deploying monitoring and maintaining it. Machine learning is an approach to learning complex patterns from existing data and use these patterns to make predictions on unseen data, it is most useful when tasks are repetitive, cost of wrong predictions is cheap, it’s at scale and the patterns are constantly changing. A relational database isn’t in ML system because it doesn’t have the capacity to learn the relationship between two columns by itself. Ml systems are part code, part data and part artifacts created from the two.
Machine learning in research VS in production Requirements: state of the art more than performance on benchmark data sets VS different stakeholders have different requirements. For instance while it can give your ML system is small performance improvement, ensembling tends to make a system too complex to be useful in production. Computational priorities: fast training, high throughput VS fast inference, low latency. When designing ML systems, people who haven’t deployed NML system often make the mistake of focusing too much on the modern development part and not enough on the deployment and maintenance part. During the model development process, we train different models, and each model does multiple passes over the training data. Each trained model then generates predictions on the validation data once to report the scores. However the validation dataset is usually much smaller than the training data. During model development, training is the bottleneck. Once the model is deployed, inference is the bottleneck. Research usually prioritises faster training whereas production usually prioritise fast in inference. To reduce latency in production you might have to reduce the number of queries you can process on the same hardware at a time. Latency is not an individual number but a distribution, it’s better to think in percentiles. Higher percentiles are important to look at because even though they account for a small percentage of your users, sometimes they can be the most important users. Data: in a researcher you must leave work with the historical well formatted data, whereas in production the data is being constantly generated by users, systems, and third-party data. Fairness: ML algorithms don’t predict the future, but encode the past, thus perpetuating biases in the data and more. Interpretability and Discussion
Requirements - At the base of every ML project there must be the Business Objective, create value for the company. - Reliability: the system should continue to perform the correct function at the desired level of performance even in the face of adversity, if the ground truth is not available - Scalability: handling resource scaling, but also artifact management. - Maintainability and adaptability
Data Engineering Fundamental Data Source: user input data; system-generated data (logs); internal databases; 3rd party data. Data Formats: Data serialization is the process of converting a data structure or object state into a format that can be stored or transmitted and restructured later. - JSON is human readable; key value pair paradigm; text file - CSV is row-major, consecutive elements in a row are stored next to each other in memory; good for accessing samples; it’s much faster to write; text file - Parquet is column-major non-human readable, consecutive elements in a column are restored the next to each other; good for accessing features (columns); it is better when you have to do a lot of column-based reads; binary file (aka non text file). For instance Pandas Dataframes are column-based whereas Numpy creates row-major arrays, that is accessing a DF by row is much slower than by column Data Models: - Relation model and normalization, SQL declarative language tells the data you want but not how to retrieve it, SQL can be Turing complete (Python is imperative) - NoSQL document model, based on a single continuous string, document=row, the document model doesn’t enforce a schema, they shift their responsibility of assuming the structure is from the application that writes the data to the application that reads the data. Compared to the relational model it is harder and less efficient to execute joins across documents compared to across tables. - NoSQL graphic model, a graph consists of nodes and edges which represent the relationships between the nodes. It is faster to retrieve data based on relationships. Fast arrival, schemeless Data Storage Engines and Processing: Transactional Analytical Processing - Online transaction processing OLTP need to be processed fast, low latency, hi availability, so that they don’t keep users waiting. They usually to be ACID: atomicity, consistency, isolation, durability. Because each transaction is often processed as you need to separately from other transactions, transactional databases I often row-major - Online analytical processing OLAP This distinction is outdated: this separation of transactional and analytical databases was due to limitations of technology, it was hard to have databases that could handle both queries efficiently. Storage and processing are tightly coupled, how data is stored is also how data is processed. The term online has become overloaded, it might refer to the speed at which your data is process the order can mean in production. ETL vs ELT (fast arrival of data since there is little processing needed before data is stored)
Modes of Dataflow: How do we pass data between different processes that don’t share memory? - Data passing through databases, both processes must be able to access the same database and read/write - Data passing through services A to B, send data directly through a network that connects these two processes. A first sends a request to process B that to specifies the data needed, B returned to the requested data through the same network, this is called request-driven (It works well for systems that rely more on logic than on data). REST representational state transfer vs RPC Remote procedure call, for instance HTTP is RESTful - Data passing through real time transport, called event-driven works better for system that are data heavy. Incoming events are stored in in memory storage before being discarded or moved to more permanent storage. Instead of using databases to broker data, we use in memory storage, real-time transports can be thought of as in memory storage for data passing among services. That’s because databases are too slow for applications with strict latency requirements. Publish-subscribe VS message-queue (such as Apache Kafka and RabbitMQ)
Batch processing, produces static features, leverage on map reduce and spark, historical data Stream processing, produces dynamic features, stream computation capacity of real-time transport like Apache Kafka, it is more difficult because the data amount is unbounded and the data comes in at variable rates and speeds
Training Data Sampling Nonprobability Sampling can cause selection bias: convenience, snowball, judgement, quota sampling Simple random sampling: stratified, divides your population into the groups that you care about and sample from each group that separately; weighted sampling, each sample is given a weight which determines the probability of it being selected, it allows to leverage domain expertise and helps when the data comes from a different distribution compared to the true data by adjusting the weights; reservoir, useful with the streaming data; importance, it allows to sample from a distribution when we only have access to another distribution which is similar to the target one.
Labeling Hand Labels Expensive, data privacy, slow, non-adaptive, label ambiguity issue when the data comes from multiple services and rely on multiple annotators with different levels of expertise. Natural Labels When the task has natural ground truth labels (example Google Maps or stock price prediction). Feedback loop length that is the time it takes from when a prediction is served until when the feedback on it is provided.
Handling the lack of labels Weak supervision relies on the concept of a labelling function: a function that encodes heuristics to generate labels; needs a small number of labelled data, but the output can be noisy. Semi supervision leverages structural assumptions to generate new labels based on a small set of initial labels. Unlike weak supervision, it requires in initial set of labels. A classic method is self-training: you start by training model on your existing set of labelled data and use this model to make predictions for and labelled samples; perturbation method applies small changes to the training instances to obtain new ones, given the assumption that the small perturbation to a sample shouldn’t change its label. Transfer learning a model developed for the task is reused as the starting point for a model on a second task Active learning improves their efficiency of data labels, you label the samples that are most helpful to your model, the ones that your model is the least certain about or based on disagreement among multiple candidate models.
Class Imbalance It is a problem in classification tasks where there is a substantial difference in the number of samples in each class of the training data (fraud detection, rare diseases, churn prediction). ML models work best with balanced data. It often means there is insufficient signal for your model to learn how to detect the minority classes and it makes it easier for your model to get stuck in a non-optimal solution by exploding is simple heuristic instead of learning anything useful about the underlying pattern of the data: if the model learns to always output the majority glass its accuracy is already very high. Class Imbalance leads to asymmetric the cost of error, the cost of a wrong prediction on a sample of the rare class might be much higher than a wrong prediction on a sample of the majority class. In the real-world class imbalance is the norm: rare events are often more interesting and/or dangerous than regular events and many tasks focus on detecting those rare events Using the right evaluation metrics Overall accuracy and error rate are insufficient, need to look at F1, recall, ROC too Data – level methods: resampling Resampling includes over sampling, adding more instances from the minority classes and under sampling, removing instances of the majority classes. When resample your training data, never evaluate your model on the resampled data since it will cause the model overfit to that resampled distribution. Algorithm-level Methods It keeps the training data distribution intact but alter the algorithm to make it more robust to cross imbalance, mainly adjusting the loss function. Cost Sensitive Learning The individual loss function is modified to take into account the difference in classes costs. Class balanced loss; Focal loss
Data Augmentation It is a family of techniques that are used to increase the amount of training data. It is mainly use for medical imaging (change pixels) and NLP (replace a word). Simple label preserving transformations is the simplest technique: randomly modify an image while preserving its label. Perturbation is similar but it’s used to trick models into making wrong predictions. Adding noisy samples to training data can help models recognize the weak spots in their learned decision boundary and improve their performance. Data Synthesis tries to train our models with synthesized data.
Feature Engineering Learned VS Engineered Features The promise of Deep Learning is that we won’t have to handcraft features, since they could be potentially learned and extracted by algorithms, for this reason DL is called feature learning. However, this is not reached yet. Handling Missing Values Three types of missing values: Missing not at Random MNAR, Missing at Random MAR, Missing completely at Random MCAR. Deletion by column or by row, risk of losing important info Imputation fills missing values with their defaults or mean, median, mode Feature Scaling ML models tend to struggle with features that follow a skewed distribution. Apply normalization, standardization or log function Discretization Turning continuous features into discrete by quantization or binning, risk of losing info.
Encoding Categorical Features In production categories can change, and the model needs to address it. One solution is the hashing trick: use a hash function to generate a hashed value of each category. Feature crossing combines two or more features to generate new features. It is useful to model the nonlinear relationships between features.
Data Leakage When a form of the label leaks into the set of features used for making predictions, and these same information is not available during inference. Splitting time correlated data randomly instead of by time, in many cases, data is time correlated, which means that the time the data is generated affects its label distribution. To prevent future information from leaking into the training process, and allowing models to cheat during evaluation, split your data by time, instead of splitting randomly whenever possible. Scaling before splitting, do not use their entire training data to generate global statistics. Before splitting it into different splits, leaking the mean and the variance of the test sample into the training process, allowing a model to adjust its predictions for the test sample. This information is not available in production, so the models performance will likely degrade. Filling in missing data with the statistics from the test split Poor handling of data duplication before splitting Leakage from data generation process To detect data leakage measured the predictive power of each feature or a set of features with respect to the target variable (label). If a feature has unusually high correlation investigates how this feature is generated and whether the correlation makes sense. Engineering good features: more features is not always good - more features mean more opportunities for data leakage, can cause overfitting, can increase memory required to serve a model, can increase inference latency when doing online production, useless features become technical debts. - Often a small number of features accounts for the large portion of the model’s feature importance - Need to assess how well a feature generalizes
Model Development and off-line evaluation Evaluating ML models When considering what model to use that, it’s important to consider its performance, but also its other properties, such as how much data, compute, and time it’s needed to train, what’s its inference, latency and interpretability. For example, a simple logistic regression model might have lower accuracy than a complex neural network, but it requires less labelled data, it’s faster to train and easier to deploy. - Avoid the state-of-the-art trap just to follow the latest trend - Start with the simplest models, use it as baseline - Avoid human biases in selecting models - Evaluate good performance now vs good performance later, think of potential/future situations - Evaluate Trade-offs, such as false positive VS false negatives or compute power VS accuracy - Understand model’s assumptions
Ensembles They are less favored in production because they are more complex to deploy and harder to maintain. Bagging (bootstrap aggregating) reduces variance and helps to avoid overfitting, instead of training, one classifier on the entire dataset it samples with replacement to create different datasets, called bootstraps and train the model on each of them, e.g. random forest. Boosting reinforce weak learners, each learner is trained on the same set of samples, but the samples are weighted differently among iterations. E.g. gradient boosting machine or XGBoost. Stacking train base learners from the training data then create a meta-learner that combines the outputs of the base learner to output final predictions, the meta-learner can be as simple as a heuristic: take the majority or average vote from all the base learners.
Experiment tracking and versioning Must track pivotal results: loss curve; model performance; predictions/labels; speed; parms and hyperparms
Distributed Training In some cases that data sample is so larger, it can’t even fit into memory and you will have to use something like gradient checkpointing. Data Parallelism It’s now the norm to train ML models on multiple machines, each worker has its own copy of the whole model and does all the computation necessary for its copy of the model; the problem is how to accurately and effectively accumulate gradients from different machines (synchronous VS Asynchronous). Model Parallelism different components of the model are trained on different machines. It doesn’t mean that different parts of the model in different machines are executed in parallel, this happens with the pipeline parallelism.
Auto ML It’s the process of finding ML algorithms to solve real problems Soft AutoML: hyperparameter tuning, they are the parameters supplied by users, whose value is used to control the learning process. With different values, the same model can give drastically different performances on the same deficit. The goal of the hyperparameter tuning is to find the optimal set for a given mode within the search space – the performance of each set are evaluated on a validation set.
Model off-line evaluation For certain tasks, it’s possible to infer approximate labels in production, based on user feedback (natural labels), for others, you might not be able to evaluate the models performance in production directly, and might have to rely on extensive monitoring to detect changes in failures in the ML systems performance. Random baseline; simple heuristic; zero rule baseline; human baseline; existing solutions. The model should be good, but also useful. Evaluation Methods Perturbations tests make small changes to the test set to see how these changes affect the model’s performance Invariance tests change the sensitive information to see if the outputs change Directional expectation tests, model calibration, confidence measurement; slice-based evaluation.
Muito bom!! É bastante difícil encontrar um material tão completo e bem escrito sobre MLOps, esse livro é um tesouro! Também é extremamente atual (publicado em maio/2022), mas um ponto muito positivo é que a autora não foca em ferramentas específicas (apesar de em certas passagens comentar as ferramentas disponíveis e pincelar como funcionam) e tutoriais, enfatizando os conceitos e desafios de um sistema de Machine Learning ponta a ponta. Traz discussões e exemplos tanto do mercado quanto de sua trajetória sobre temas muito importantes, como feature engineering, monitoramento e observabilidade, performance, ética em IA, plataformas de ML, feature stores, batch vs streaming, escolha de modelos, e muito mais! Só é importante ter em mente que esse NÃO é um livro para iniciantes, é muito importante ter um bom entendimento sobre Ciência de Dados e Engenharia de Software para aproveitar a leitura ao máximo. Enfim, recomento muuuuito para quem quer ser/aprender um cientista de dados/engenheiro de ML completo!
As a ML engineer, this is the best book I have ever read about practical tips that you can use in your daily work. Think about you are going to build a ML team that is responsible for providing intelligent solutions, you are required to not only figure out what the solution might be, but also how to push your solution into production and keep maintaining it. So your role is part of DS + ML scientist + ML infra engineer + QE etc. You would find almost all you need for each step of the workflow. Must read for any full stack ML person.
As a Data Leader I think Chip has written one of the best pieces on Machine Learning Systems. The book softly approaches several interesting aspects regarding implementing processes to deploy machine learning systems in large scale.
It is very well written and keeps you engaged throughout the whole book. If you are interested in grasping some insights about the necessary processes when dealing with machine learning systems I think you will really enjoy it.
This is like DDIA for ML systems, but it somehow requires even less prerequisite knowledge. It reads super easily, I was able to finish most of the book in a single day. Very strongly recommend
It's a good introductory book, but very high level, introducing concepts in a way that is easy to follow. It could be better if it covered systems a bit more in depth (for instance, going through an actual inference system in a whiteboard-interview style).
Data science is not that hard. Simply clean and annotate a dataset, select one of the available algorithms from the basics like linear regression to the latest transformer architecture neural networks, train and optimize a loss function for accuracy, precision, and/or recall, make some pretty plots, and move on. You did it!
Unfortunately, the company is going to need you to keep doing that every single day, forever.
Deploying and maintaining models in production is machine learning engineering and operations, and it is in fact pretty hard. Designing Machine Learning Systems is a solid introduction about how to go from ad hoc data science to continual learning with machine learning engineering and operations.
The first and foremost issue is one of data shifts. The data coming into any system is continuously evolving, and entropy means that changes are away from the data that the model was trained on. This means that a useful ML product has to be constantly retrained and redeployed, even in the absence of
The second issue is that platforms and tooling for doing this is apparently not great. Code versioning via Git is solid. Model versioning via some kind of artifact store is okay, but varies via company. Data versioning is likely bad, requiring painstaking reconstruction from a data swamp (like a data lake, but full of sludge). And the totality of being able to maintain a consistent workflow around code, data, models, and compute is basically non-existent.
This book has a lot of good questions to ask and targets to aim for, especially in the later chapters (I found the first five or so chapters very basic), but fewer good answers, particularly around the key questions of what metrics to monitor and when to refresh models. I guess this is why they pay us.
Billed as a resource for academic ML researchers to transition into industry and understand how models get deployed “in the real world”, this book is an excellent overview of the problem spaces addressed by MLOps/ML Platform teams. The book assumes some familiarity with various ML model families, but largely takes a systems-oriented approach that’s accessible to software engineers.
Different chapters were varying levels of technical, but provided a solid taxonomy of sub-problems, mixed in with some practical advice, overviews of popular vendors/tools and looooooots of paper citations. In a world where LLMs are drastically lowering the knowledge and data barriers to creating viable models, this book demonstrates to engineers who aren’t ML specialists how to incorporate such tech with sufficient nuance and rigor to avoid some of the most common pitfalls.
I got this book as a gift and didn’t know what to expect, but I knew it would be good stuff coming from Huyen. She’s one of the biggest voices in the MLOps scenario right now, and her teaching method is both pleasant and effective (also, her blog is packed with useful content).
The book provides an overview and tips-and-hints regarding every step of a machine learning model lifecycle, but leans toward model serving in production rather than local notebook development.
It covers all the major issues I ran into in my past projects, so I definitely recommend the read if you’re a junior or semi-senior, but even if you’re a seasoned data scientist.
The book is not as dense as Andriy Burkov’s “Machine Learning Engineering”, but despite that, I could learn quite interesting new tricks.
Excellent book on designing machine learning systems! A just read for any ML or data science practitioners. Most people who get into machine learning start from the academic perspective of machine learning, and then learn the hard way the challenges of actually putting machine learning into production. The author does a great job of bridging this gap, presenting different types of problems that someone might face when going from training models in a notebook to putting said model into production, and more broadly how machine learning can be used in industry to answer questions. Great read!
As a data scientist in the non-tech industry who has limited data engineering / deployment experience, I know I have much to learn, but I don't even know where to start. Fortunately, I found this book, which covers basics and introduces best industry practices using real examples. I will recommend it to any data scientist who wants to grow into an end-to-end role.
This book gives a comprehensive overview of the types of activities and processes that businesses must put into place in order to use machine learning effectively in their day-to-day operations.
One caveat is that large language models are only mentioned briefly, but this is understandable, as LLMs are a fairly recent development, and they are not yet widely used in business settings.
I would recommend this book to any data scientist or data engineer who develops or deploys ML models. This book expanded my thinking and will improve my work and how ML can contribute positively to business and society.
Great, fresh and a timely book that covers all aspects of machine learning systems in a very accessible style. ML systems are complex and a lot of factors and skillsets have to be considered. Because of that, a wide variety of people - software/data engineers, data scientists, DevOps, middle/top managers - would benefit from reading this book. The coverage is not too deep, but the author provides enough references for those who want to dig deeper. And this keeps the material well-balanced.
Awesome build of a framework for end to end ML systems. Can be improved in parts (e.g. some parts read too high-level to be meaningful even for a framework book, some diagrams need improvement etc.) but overall a great book.
This book provides a comprehensive and holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.
I found the book's iterative framework to be particularly valuable. It helped me understand how different design decisions, such as how to process and create training data, which features to use, how often to retrain models, and what to monitor, can impact the system as a whole. This framework made it easier for me to align our team's efforts with the organization's objectives and to make informed decisions about our ML systems.
The book also covers a wide range of topics, from engineering data and choosing the right metrics to automating model development, deployment, and updating, developing a monitoring system, and architecting an ML platform that serves across use cases. The author's use of actual case studies and references made it easy to relate the concepts to real-world scenarios.
I also appreciated the author's focus on full-stack machine learning concepts rather than tooling. This helped me understand how to evaluate different tools and how to position them within our ecosystem. The book also covers responsible ML systems, which is an important topic for any ML team.
In conclusion, as an Engineering Manager, I found this book to be an invaluable resource for understanding how to design and implement ML systems end-to-end. I recommend it to any manager or engineer who is looking to leverage ML to solve real-world problems.
Fairly high-level end-to-end overview of the current state of the world in ML, but don't expect to learn anything radically new. I'd recommend to this someone new to DS/ML, but probably not to anyone with >=2 years in the industry.
I’m amazed by how comprehensive this book is. It now lives next to my desk as a reference when I’m working on ML Engineering tasks. Note that it won’t make you an expert on the tools you’d need as an ML Engineer, but it will be a broad overview of what to learn next.
Great overview for everyone who's interested in how ML models are actually run in production. It's fairly high-level, but explains a lot of topics very well, and gives you enough info to do more research on your own in case specific topic is especially interesting. Very easy to read, very little prior knowledge needed. It does have some amount of probability theory, but still well explained, so a commoner like me could follow. If you're interested in becoming an ML engineer, book will give you great understanding on all steps surrounding actual model development - how to deploy, how (and what) to monitor, when and how to re-train etc.
Chip is a great educator. I was curious why my feeds were blowing up with reviews of her book and took a while to get to reading it. I could not put it down until I completed it. All chapters are thoroughly researched especially the ones on feature engineering, training data, data distribution shifts and monitoring. The real-life examples and anecdotes are the most useful for anyone entering this field. The book still reads fresh after 2 years and I am looking forward to her next book which will release shortly.
This book looks like a student project to me the structure of the contents seems not contained where concept topics are leaked into each other and as a result you get to be confused about what you reading at the moment. I was hoping to see a conclusive architecture of a machine learning system or something that resembled it but nah let’s describe each technology in a sentence and get away with it.