Jump to ratings and reviews
Rate this book

Spark: The Definitive Guide: Big Data Processing Made Simple

Rate this book
Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.

You'll explore the basic operations and common functions of Spark's structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark's scalable machine-learning library.


Get a gentle overview of big data and Spark
Learn about DataFrames, SQL, and Datasets--Spark's core APIs--through worked examples
Dive into Spark's low-level APIs, RDDs, and execution of SQL and DataFrames
Understand how Spark runs on a cluster
Debug, monitor, and tune Spark clusters and applications
Learn the power of Structured Streaming, Spark's stream-processing engine
Learn how you can apply MLlib to a variety of problems, including classification or recommendation

603 pages, Paperback

Published April 3, 2018

303 people are currently reading
565 people want to read

About the author

Bill Chambers

38 books1 follower

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
110 (39%)
4 stars
113 (40%)
3 stars
46 (16%)
2 stars
7 (2%)
1 star
3 (1%)
Displaying 1 - 30 of 31 reviews
Profile Image for Vicki.
531 reviews241 followers
December 17, 2021
This plus Learning Spark are critical to working with the framework IMO.
105 reviews47 followers
June 13, 2020
Unnecessarily long. Could have been written in 40 pages. You don't have to explain things again and again, we get it, it's easier than Hadoop.

Anyway since it's so low level, didn't help me much at all, anyway I have to write code with annotations so the low-level understanding didn't help me much, because in a collaborative structured project, there's no need to find optimization in spark. (Everyone use HiveQL anyway but shhhh! Don't let the buyers know that.)
Profile Image for Joaquín Chemile.
99 reviews4 followers
February 8, 2020
Un libro interesante. No es malo. Pero el principal problema es que trata de abarcar multiples tópicos complejos: desde machine learning, realización de etls, paquetes de Spark, producción de aplicaciones hasta streaming. Lo que hace un libro a mitad del camino entre un cookbook y un libro teórico.

El overview de los primeros capítulos es muy interesante; Aunque un poco desactualizado segun he estado consultando blogs especializados y la propia API de Spark.

Creo que sería más útil tener dos libros: uno que trate sobre los asuntos relacionados al diseño de aplicaciones de consumo de datos desde una óptica práctica. Funcionando como el cookbook de Spark del libro: "Designing Data Intensive Applications"; y otro que trate las librerías de machine learning disponibles. Las parte de Streaming lo eliminaría quitaría (Parte V) para profundizar más sobre las Low-Level APIs (Parte III)

Si uno está interesado en aprender sobre procesamiento de datos con Spark:
I. Gentle Overview of Big Data and Spark
II. Structured APIs—DataFrames, SQL, and Datasets
IV. Production Applications


Si uno está interesado en realizar entender como se implementan algoritmos de machine learning en spark (no es un libro para aprender desde cero, sino para saber como hacerlo en spark):
I. Gentle Overview of Big Data and Spark
VI. Advanced Analytics and Machine Learning
Profile Image for Dylan Meeus.
32 reviews1 follower
January 29, 2021
Definitely a good introduction to Spark.

I've used this book (along with "Learning Spark") to pass my Spark Databricks Certification exam. In Chapter I, II and IV it covers enough ground to get a good grip on the architecture and "how it works", but, as with anything code related, the best way to really learn the DataSet / DataFrame API is to just start using it.

There was a bit of repetition that perhaps wasn't needed though.

+1 for having Scala and Python examples!

What could definitely be better are some of the architecture drawings. Using just random shapes and assigning them in text as "executor" / "driver" is much less clear than just labeling them explicitely. Also some of the figures had no legend. Stuff like that :-)
Profile Image for Larry.
756 reviews4 followers
August 30, 2021
I'm going to mark this one as 'finished'. There are additional sections on streaming Spark and machine learning that I'm probably not going to work through in detail.

Excellent resource. This is my primary Spark reference. I really got a lot out of the examples at github.

The only real drawback of this book is, it is way overdue for a new edition. It's a 2018 book that covers Spark 2.0. Spark 3.0 was released last year with a lot of important changes and we are now on 3.1.
Profile Image for Mikhail Filatov.
363 reviews17 followers
June 27, 2022
It's strange to see architecture diagram of quality worse than "System design" whiteboard interviews - strange circles and squares with different colors? (book is not printed in color).
Also, there was a question O'reilly editors which sneaked into the main text.
And overall, a lot of focus was on API, Streaming and ML - with no e2e scenario for Big Data, how Spark can work together with Data Lakes, etc.
Profile Image for Breno Ferreira.
105 reviews
August 4, 2025
A solid introduction to Apache Spark, covering its architecture and the practical use of its APIs (with examples in both Python and Scala). Readers with some prior knowledge of distributed systems will go through some sections more easily. While the book does a good job explaining core concepts, it would have benefited from a deeper exploration of performance optimizations and troubleshooting techniques.
2 reviews
July 25, 2023
Comprehensive with its own shortcomings: Engineering best-practices are too shallow. Touched very lightly on cluster management and Spark UI inspection (A daily chore for most). Repetitive and cross-referenced too much (While saying "due to space limitations go read xxx"). Arbitrarily drawn architectural figures.
Might be helpful for those to are newly exposed to Spark.
Profile Image for Kalyan Tirunahari.
126 reviews
October 16, 2019
Lives to the expectation of the title. For all levels of readers but familiarity with Scala or Python is needed. Examples in both the languages. Tuning and optimization should have been covered in more detail.
Profile Image for Nameer Metori.
1 review1 follower
March 7, 2023
Although I did not cover all the chapters/parts, Streams and ML parts were mainly skipped, it was a decent book to follow and learn through.
The DataFrame API was nicely organized and well correlated which was my main goal to study this book.
Profile Image for Diego Gomez.
31 reviews
October 21, 2024
The go-to resource for anyone looking to get started with Apache Spark.

While it's a comprehensive and detailed guide, the book is quite long and serves more as a reference to revisit than a straight-through read.
Profile Image for Denis.
18 reviews
May 15, 2025
Like any technical book, you don’t have to read it from cover to cover.
Some of its chapters stand on their own, helping to build intuition without getting bogged down in syntax details—which is useful. Plus, it’s well-written that it’s worth giving it a try.
Profile Image for f1yegor.
9 reviews1 follower
September 14, 2018
Broad, used to systemize knowledge for Databricks certification and Spark 2.2 update
2 reviews1 follower
February 5, 2019
There are a lot of typos in this book.
Profile Image for Vishal Goel.
63 reviews28 followers
July 28, 2023
A nice book for a beginner. They cover a lot of ground and I especially liked the Spark Streaming part at the end of the book but they do not go very deep into each concept.
11 reviews
September 28, 2024
A book is a book, and i read this.

It was very informative and valuable for the work i am doing. Would recommend to anyone starting up with spark and scala.
221 reviews12 followers
November 30, 2018
This has to be the most poorly edited book I've ever read. some examples: there is a figure with boxes that represent two different kinds of components. the way to tell the components apart is by their shading. however the shading for all the boxes in the figure is exactly the same. there are long running code examples that could not possibly compile, and violate basic principles of Scala programming (e.g. case classes treated as mutable objects). And there are TODO style notes still present in the text, such as, something like "talk to the people at O'Reilly to get the names of some books to fill out this section."

Many sections of the book provide only surface level treatment of a topic, or leave out discussion of topics entirely as "outside the scope of this book." Not that there is anything wrong with this in general, but definitely not what I expect from a book entitled "The Definitive Guide."

Setting aside these two flaws, this is an excellent book. I sincerely hope they do more proofing and editing for the next edition.
Profile Image for Alex Ott.
Author 3 books207 followers
December 31, 2019
Really between 4 & 5 stars because of some discrepancies in examples, etc.

But, it's really good book about current version of Spark (2.2 & some mentions of 2.3). The book is mostly concentrated on the DataFrames, in contrast with other Spark books that mostly talking about RDDs.

A lot of useful information, including Structured Streaming, Machine learning, and even short description of GraphFrames.

Highly recommneded
Profile Image for Gourav Sengupta.
15 reviews3 followers
January 5, 2019
if you can use additional data sets from the internet, then this makes for brilliant reading. The examples are just introductory, therefore, using additional data sets to work out different scenarios will really benefit.
Profile Image for Gavin.
Author 2 books560 followers
July 27, 2018
It's fine, covers everything shallowly. The API changes so frequently that you probably need this book: 95% of the Google hits for a given Spark feature are now either wrong or suboptimal.
1 review
November 23, 2018
Must have in terms of the root mechanisms of the Spark but take account that all major APIs are continuously being changed so always consider the version
Profile Image for LIUF.
30 reviews2 followers
March 9, 2021
This is a good entry level book to learn spark SQL. After finishing this book, I could write satisfying spark DataFrame code for production use.
Displaying 1 - 30 of 31 reviews

Can't find what you're looking for?

Get help and learn more about the design.