More on this book
Kindle Notes & Highlights
by
Chip Huyen
Read between
January 5 - January 20, 2025
Progress in the last decade shows that the success of an ML system depends largely on the data it was trained on. Instead of focusing on improving ML algorithms, most companies focus on managing and improving their data.
Garbage in and garbage out. If an ML engineer is not obsessed about data quality (e.g. feature/label correctness, dataset distributional properties, etc), it's a clear indication that they don't understand how ML actually works.
A repository for storing structured data is called a data warehouse. A repository for storing unstructured data is called a data lake.
Samples with higher weights affect the loss function more. Changing sample weights can change your model’s decision boundaries significantly,
According to Krizhevsky et al. in their legendary AlexNet paper, “The transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.”48
Frederick P. Brooks, “What one programmer can do in one month, two programmers can do in two months.”