This book is a real mess, at the point you can even challenge it being called a book. It needs serious editing, which is not surprising since it’s a work in progress by a data scientist, freely available on GitHub. Expect placeholders, typos, broken links here and there.
Nonetheless, it had a few good insights for me. I’m still early in my journey on learning about data science and consider myself a beginner.
The author has a data platform blueprint, made of five loosely coupled areas: Connect, Buffer, Processing Frameworks, Store and Visualize.
It is simple, probably only good common sense for seasoned data engineers but at least it helped me apprehend these areas and challenges.
The author later warns about ML in production at scale: “Doing machine learning in production is very different than for proof of concepts or in education. One of the hardest parts is keeping models updated.”. It is helpless as it provides no solution here (what about Airflow ?) but at least I am warned.
The book finishes with a collection of links covering data science usage at several famous companies (Airbnb, Netflix...), public cloud (at least for the usual suspects, ie GCP, AWS, Azure) offerings, public data sets, and finally data scientist interview questions (that I personally found for most of them a bit too simple, but I loved the idea of covering this subject).