Jump to ratings and reviews
Rate this book

Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark

Rate this book
Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools. Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization.

349 pages, Paperback

Published July 18, 2017

30 people are currently reading
110 people want to read

About the author

Russell Jurney

6 books3 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
8 (15%)
4 stars
21 (40%)
3 stars
14 (26%)
2 stars
9 (17%)
1 star
0 (0%)
Displaying 1 - 10 of 10 reviews
Profile Image for Rebecca Bilbro.
Author 2 books4 followers
March 7, 2018
Favorite quotes:
- "In data science, by contrast to software engineering, code shouldn't always be good; it should be eventually good."
- "In Agile Data Science, we value generalists over specialists...Examples of good Agile Data Science team members include: Designers who deliver working CSS; Web developers who build entire applications and understand the user interface and user experience; Data scientists capable of both research and building web services and applications; Researchers who check in working source code, explain results, and share intermediate data; Product managers able to understand the nuances in all areas"
- "In data products, the data is ruthlessly opinionated. Whatever we wish the data to say, it is unconcerned with our own opinions. It says what it says. This means the waterfall model has no application. It also means that mocks are an insufficient blueprint to establish consensus in software teams."
- "Extracted features from unstructured data get cleaned only in the harsh light of day, as users consume them and complain; if you can't ship your features as you extract them, you're in a state of free fall. The hardest part of building data products is pegging entity and feature extraction to products smaller than your ultimate vision. This is why schemas must start as blobs of unstructured text and evolve into structured data only as features are extracted.
Features must be exposed in some product form as they are created, or they will never achieve a product-ready state. Derived data that lives in the basement of your product is unlikely to shape up. It is better to create entity pages to bring entities up to a "consumer-grade" form, to incrementally improve these entities, and to progressively combine them than to try to expose myriad derived data in a grand vision from the get-go."
- "Rare is the chart that tells a story. This is because most people make a chart and move on… when in reality, you have to iteratively create and improve charts to achieve useful visualizations... You can create charts in an ad hoc way at first, but as you progress, your workflow should become increasingly automated and reproducible."
-"Agile Data Science is an approach to data science centered around web application development. It asserts that the most effective output of the data science process suitable for effecting change in an organization is the web application. It asserts that application development is a fundamental skill of a data scientist. Therefore, doing data science becomes about building applications that describe the applied research process: rapid prototyping, exploratory data analysis, interactive visualization, and applied machine learning."
4 reviews
May 19, 2018
I really wanted to love this book. The concept of walking through a quite elaborate example is excellent. I think it shows many of the pit falls and iterations you need to do from start to finished data product.

However, as one of the other reviews also points out, the code example in there are unfinished and the instructions are hard to follow. For me it started in chapter 4, where you are asked to run the first piece of code to convert data. However, I wasn't sure where to run it and it took me a while to figure out it was supposed to be executed in Spark. Probably just me being slow.
Once I booted spark up it would still not run, as there was a piece of missing initialization data which I only found once I looked in another branch on git. That was just the beginning. Once I got to connecting Flask and Mongo it really went downhill as the versions weren't compatible.

So if the code gets some polish and a ready to use image for Vagrant and EC2 that would work, then it will be a 5-star book.

1 review
May 8, 2018
It's a good book for programmer or data science beginner level to know the data science concept and popular tool-set in the industry.
However the book is not up-to-date to keep with the latest software version. Thus some of the codes in the book are not working as expected(Too bad..). So it means the quality of the book is under the standard.
For example, it's using pyElasticsearch package however it's only support ElasticSearch (<2.0) version.
It's an ok book and needs more polished in my opinion.
Profile Image for Alex.
9 reviews1 follower
April 9, 2019
This book describes Russell's perspectives on good data science workflow using an agile methodology. He walks through a project about airline flight data in great detail and shows off some really neat tricks for building web apps and doing predictive analytics at scale. I would describe the material at intermediate level, where the reader should already be familiar with the data science ecosystem.

I loved chapter 2, which introduces the technology stack. It's awesome to see minimal working snippets from a whole lineup of open source tools that comprise Russell's pipeline. In particular the Ariflow section her is quite nice. In the remainder of the book, we see how to pull these technologies together.

As others have mentioned, the code is not 100% plug and play. This is hardly surprising given the quickly evolving nature of open source, and particularly how new his tools are. Sure, you could pick up a book on MySQL and run most of the code without issue, but Russell is working with much newer (and frankly more interesting) technologies. From my perspective, I am not running any code from the book but just reading through and noting code that will make great reference later. A couple issues that did bother me was the occasional typo, repeated code block or missing attention to detail in presentation. But don't let that stop you from checking out this excellent book.
Profile Image for Kyle Dinges.
401 reviews11 followers
January 14, 2021
I'm not sure anyone comes to Goodreads for textbook reviews, but I read it so I'm reviewing it briefly...

3.5 stars. It's a good use-case for getting a web app up and running that includes a machine learning model and that scales easily. The title isn't kidding when it says full-stack, the case here leverages: MongoDB, ElasticSearch, Kafka, Airflow, numerous Python (Flask, sci-kit learn, etc...) libraries, and more all with Spark (through Pyspark) as the primary engine.

It's probably most useful as a primer on building a Spark based machine learning model and, most importantly, deploying it. I'd say it requires a cursory-to-intermediate understanding of most of the technologies included. It's getting a bit long in the tooth now that it's almost 5 years old, but if you know enough to follow along, you probably know where any potential deprecations lie.

I thought it was helpful for those with an intermediate Data Science background.
Profile Image for Jose Manuel.
241 reviews4 followers
December 20, 2017
impresionante. Pese a ser R mi opción principal y este libro usar Python, su enfoque , centrándose en la parte "científica" de la labor del Data Scientist es claramente acertada. Los primeros capítulos describen mi día a día de forma tan acertada que me ha llegado a emocionar. Su enfoque de mantener las cosas tan sencillas y escalables como sea posible centrándonos en las personas más que en los procesos, liberando resultados de manera rápida y continuada a lo largo del proceso, son consejos que aplico en mi día a día y recomiendo a cualquier persona que se dedique a esto.
Profile Image for Vaidas.
118 reviews4 followers
August 24, 2018
Interesting ideas and quite detail explanation of implementation.
I read this mainly for the description of the process and for hints how one might actually go about implementing all the steps. Book is clear on these points and therefore 5 stars. If one actually went on and ran the code probably something might not really work - but that's software :)
I am actually into these ideas and will do my best to get DS process at my current employer as close as is practical to this.
Profile Image for Russell Jurney.
26 reviews4 followers
March 5, 2022
As of February 2022, the code examples on Github are fully updated and there is now a simple Dockerfile you can run and Jupyter notebooks for all examples. This greatly enhances the book - but if you don't know to look in the Github repository for this information you are likely to be frustrated.
29 reviews1 follower
April 8, 2021
A bit dated and rough around the edges but still excellent.
Profile Image for Joe.
445 reviews18 followers
October 2, 2019
Not bad at illustrating the concepts, but a bit too specific for the technology stack that was mentioned in the book. I thought this was helpful for data scientists to understand different steps in the process that they don't always see(DevOps, etc.).

The author's definition of "data science" (page 4) is more similar to "big data" than "statistics," so beware that you're not going to get a lot of stats out of this book.
Displaying 1 - 10 of 10 reviews

Can't find what you're looking for?

Get help and learn more about the design.