Putting predictive models into production is one of the most direct ways that data scientists can add value to an organization. By learning how to build and deploy scalable model pipelines, data scientists can own more of the model production process and more rapidly deliver data products. This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and scikit-learn, and will focus on scaling up prototype models to production. From startups to trillion dollar companies, data science is playing an important role in helping organizations maximize the value of their data. This book helps data scientists to level up their careers by taking ownership of data products with applied examples that demonstrate how Translate models developed on a laptop to scalable deployments in the cloudDevelop end-to-end systems that automate data science workflowsOwn a data product from conception to production The accompanying Jupyter notebooks provide examples of scalable pipelines across multiple cloud environments, tools, and libraries (github.com/bgweber/DS_Production). Book ContentsHere are the topics covered by Data Science in Chapter 1: Introduction - This chapter will motivate the use of Python and discuss the discipline of applied data science, present the data sets, models, and cloud environments used throughout the book, and provide an overview of automated feature engineering. Chapter 2: Models as Web Endpoints - This chapter shows how to use web endpoints for consuming data and hosting machine learning models as endpoints using the Flask and Gunicorn libraries. We'll start with scikit-learn models and also set up a deep learning endpoint with Keras. Chapter 3: Models as Serverless Functions - This chapter will build upon the previous chapter and show how to set up model endpoints as serverless functions using AWS Lambda and GCP Cloud Functions. Chapter 4: Containers for Reproducible Models - This chapter will show how to use containers for deploying models with Docker. We'll also explore scaling up with ECS and Kubernetes, and building web applications with Plotly Dash. Chapter 5: Workflow Tools for Model Pipelines - This chapter focuses on scheduling automated workflows using Apache Airflow. We'll set up a model that pulls data from BigQuery, applies a model, and saves the results. Chapter 6: PySpark for Batch Modeling - This chapter will introduce readers to PySpark using the community edition of Databricks. We'll build a batch model pipeline that pulls data from a data lake, generates features, applies a model, and stores the results to a No SQL database. Chapter 7: Cloud Dataflow for Batch Modeling - This chapter will introduce the core components of Cloud Dataflow and implement a batch model pipeline for reading data from BigQuery, applying an ML model, and saving the results to Cloud Datastore. Chapter 8: Streaming Model Workflows - This chapter will introduce readers to Kafka and PubSub for streaming messages in a cloud environment.
Here's a dark secret about myself as a data scientist. I'm really good with a Jupyter Notebook. It's a great interactive development environment, particularly for data heavy work that needs a lot of eyes on the data, and helps me deal with everyday problems why a column in my data is dirty, or how to get my tick marks to line up properly on a graph. You can do a lot with Jupyter Notebooks. My own Project Firemind was done entirely on Notebooks running on a ridiculous AWS deep learning instance.
But you get to the real world, and it turns out that it doesn't run on notebooks. Business wants you to do something every day. Business wants it to scale. Business wants lots of uptime. And Jupyter Notebooks don't do that. Data science in production is a decent introduction to going from a trainee data scientist working in notebooks, to a real data science working with models hosted on scaling clusters with web end-points. Weber is a data scientist with Zynga, so he knows his stuff. This book is mostly focused on applications, with specific tips on using AWS and Google Cloud Platform. Cloud tech is changing pretty quickly, so I'm sure the specific implementations will change, but this is a solid book of examples if you want to take the next step as professional.
And one note, I bought this book on Kindle and fought with the layout the whole time. The book is well-laid out, but in a way designed for vertical page views rather than flowing text, which makes sense for a programming book. You should get it on pdf.
I rated this book 5-stars because of Ben's no-nonsense approach to the topic. Every chapter is straight to the point, and clearly explained in a stepwise fashion. Good reference for data scientists looking to understand the 'how' and 'why' of deploying their models to production