Rate this book

Spark: The Definitive Guide: Big Data Processing Made Simple

Name: Spark: The Definitive Guide: Big Data Processing Made Simple
Rating: 4.15 (31 reviews)
ISBN: 9781491912218

Bill Chambers, Matei Zaharia

Rate this book

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.

You'll explore the basic operations and common functions of Spark's structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark's scalable machine-learning library.

Get a gentle overview of big data and Spark
Learn about DataFrames, SQL, and Datasets--Spark's core APIs--through worked examples
Dive into Spark's low-level APIs, RDDs, and execution of SQL and DataFrames
Understand how Spark runs on a cluster
Debug, monitor, and tune Spark clusters and applications
Learn the power of Structured Streaming, Spark's stream-processing engine
Learn how you can apply MLlib to a variety of problems, including classification or recommendation

GenresTechnologyComputer ScienceProgrammingTechnicalSoftwareNonfictionTextbooks

603 pages, Paperback

Published April 3, 2018

303 people are currently reading

565 people want to read

About the author

Bill Chambers

38 books1 follower

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

110 (39%)

4 stars

113 (40%)

3 stars

46 (16%)

2 stars

7 (2%)

1 star

3 (1%)

Displaying 1 - 30 of 31 reviews

Matt Chequers

39 reviews

August 23, 2023

Great resource for getting started with Spark

Vicki

531 reviews241 followers

December 17, 2021

This plus Learning Spark are critical to working with the framework IMO.

Ankit

105 reviews47 followers

June 13, 2020

Unnecessarily long. Could have been written in 40 pages. You don't have to explain things again and again, we get it, it's easier than Hadoop.

Anyway since it's so low level, didn't help me much at all, anyway I have to write code with annotations so the low-level understanding didn't help me much, because in a collaborative structured project, there's no need to find optimization in spark. (Everyone use HiveQL anyway but shhhh! Don't let the buyers know that.)

Joaquín Chemile

99 reviews4 followers

February 8, 2020

Un libro interesante. No es malo. Pero el principal problema es que trata de abarcar multiples tópicos complejos: desde machine learning, realización de etls, paquetes de Spark, producción de aplicaciones hasta streaming. Lo que hace un libro a mitad del camino entre un cookbook y un libro teórico.

El overview de los primeros capítulos es muy interesante; Aunque un poco desactualizado segun he estado consultando blogs especializados y la propia API de Spark.

Creo que sería más útil tener dos libros: uno que trate sobre los asuntos relacionados al diseño de aplicaciones de consumo de datos desde una óptica práctica. Funcionando como el cookbook de Spark del libro: "Designing Data Intensive Applications"; y otro que trate las librerías de machine learning disponibles. Las parte de Streaming lo eliminaría quitaría (Parte V) para profundizar más sobre las Low-Level APIs (Parte III)

Si uno está interesado en aprender sobre procesamiento de datos con Spark:
I. Gentle Overview of Big Data and Spark
II. Structured APIs—DataFrames, SQL, and Datasets
IV. Production Applications

Si uno está interesado en realizar entender como se implementan algoritmos de machine learning en spark (no es un libro para aprender desde cero, sino para saber como hacerlo en spark):
I. Gentle Overview of Big Data and Spark
VI. Advanced Analytics and Machine Learning

computer-science

Dylan Meeus

32 reviews1 follower

January 29, 2021

Definitely a good introduction to Spark.

I've used this book (along with "Learning Spark") to pass my Spark Databricks Certification exam. In Chapter I, II and IV it covers enough ground to get a good grip on the architecture and "how it works", but, as with anything code related, the best way to really learn the DataSet / DataFrame API is to just start using it.

There was a bit of repetition that perhaps wasn't needed though.

+1 for having Scala and Python examples!

What could definitely be better are some of the architecture drawings. Using just random shapes and assigning them in text as "executor" / "driver" is much less clear than just labeling them explicitely. Also some of the figures had no legend. Stuff like that :-)

Larry

756 reviews4 followers

August 30, 2021

I'm going to mark this one as 'finished'. There are additional sections on streaming Spark and machine learning that I'm probably not going to work through in detail.

Excellent resource. This is my primary Spark reference. I really got a lot out of the examples at github.

The only real drawback of this book is, it is way overdue for a new edition. It's a 2018 book that covers Spark 2.0. Spark 3.0 was released last year with a lot of important changes and we are now on 3.1.

Mikhail Filatov

363 reviews17 followers

June 27, 2022

It's strange to see architecture diagram of quality worse than "System design" whiteboard interviews - strange circles and squares with different colors? (book is not printed in color).
Also, there was a question O'reilly editors which sneaked into the main text.
And overall, a lot of focus was on API, Streaming and ML - with no e2e scenario for Big Data, how Spark can work together with Data Lakes, etc.

Breno Ferreira

105 reviews

August 4, 2025

A solid introduction to Apache Spark, covering its architecture and the practical use of its APIs (with examples in both Python and Scala). Readers with some prior knowledge of distributed systems will go through some sections more easily. While the book does a good job explaining core concepts, it would have benefited from a deeper exploration of performance optimizations and troubleshooting techniques.

DanielPablo

2 reviews

July 25, 2023

Comprehensive with its own shortcomings: Engineering best-practices are too shallow. Touched very lightly on cluster management and Spark UI inspection (A daily chore for most). Repetitive and cross-referenced too much (While saying "due to space limitations go read xxx"). Arbitrarily drawn architectural figures.
Might be helpful for those to are newly exposed to Spark.

Kalyan Tirunahari

126 reviews

October 16, 2019

Lives to the expectation of the title. For all levels of readers but familiarity with Scala or Python is needed. Examples in both the languages. Tuning and optimization should have been covered in more detail.

tech

Nameer Metori

1 review1 follower

March 7, 2023

Although I did not cover all the chapters/parts, Streams and ML parts were mainly skipped, it was a decent book to follow and learn through.
The DataFrame API was nicely organized and well correlated which was my main goal to study this book.

tech

Diego Gomez

31 reviews

October 21, 2024

The go-to resource for anyone looking to get started with Apache Spark.

While it's a comprehensive and detailed guide, the book is quite long and serves more as a reference to revisit than a straight-through read.

Denis

18 reviews

May 15, 2025

Like any technical book, you don’t have to read it from cover to cover.
Some of its chapters stand on their own, helping to build intuition without getting bogged down in syntax details—which is useful. Plus, it’s well-written that it’s worth giving it a try.

f1yegor

9 reviews1 follower

September 14, 2018

Broad, used to systemize knowledge for Databricks certification and Spark 2.2 update

Alan

2 reviews1 follower

February 5, 2019

There are a lot of typos in this book.

Carlos

21 reviews

November 10, 2019

Very good if you are coming from pandas and to scale up.

Michael David Cobb

255 reviews7 followers

Read

January 6, 2020

Snore. I'll stick with KSQL. Streams are going to be much more interesting.

80 reviews1 follower

Clear and concise.

33 reviews1 follower

I've learnt a lot from the book. In my opinion, it's much better than "Learning Spark 2nd edition".

data-processing

Vishal Goel

63 reviews28 followers

July 28, 2023

A nice book for a beginner. They cover a lot of ground and I especially liked the Spark Streaming part at the end of the book but they do not go very deep into each concept.

Aditya Kumar

11 reviews

September 28, 2024

A book is a book, and i read this.

It was very informative and valuable for the work i am doing. Would recommend to anyone starting up with spark and scala.

John

221 reviews12 followers

November 30, 2018

This has to be the most poorly edited book I've ever read. some examples: there is a figure with boxes that represent two different kinds of components. the way to tell the components apart is by their shading. however the shading for all the boxes in the figure is exactly the same. there are long running code examples that could not possibly compile, and violate basic principles of Scala programming (e.g. case classes treated as mutable objects). And there are TODO style notes still present in the text, such as, something like "talk to the people at O'Reilly to get the names of some books to fill out this section."

Many sections of the book provide only surface level treatment of a topic, or leave out discussion of topics entirely as "outside the scope of this book." Not that there is anything wrong with this in general, but definitely not what I expect from a book entitled "The Definitive Guide."

Setting aside these two flaws, this is an excellent book. I sincerely hope they do more proofing and editing for the next edition.

software

Alex Ott

Author 3 books207 followers

December 31, 2019

Really between 4 & 5 stars because of some discrepancies in examples, etc.

But, it's really good book about current version of Spark (2.2 & some mentions of 2.3). The book is mostly concentrated on the DataFrames, in contrast with other Spark books that mostly talking about RDDs.

A lot of useful information, including Structured Streaming, Machine learning, and even short description of GraphFrames.

Highly recommneded

big-data

Gourav Sengupta

15 reviews3 followers

January 5, 2019

if you can use additional data sets from the internet, then this makes for brilliant reading. The examples are just introductory, therefore, using additional data sets to work out different scenarios will really benefit.

Wojtekwalczak

16 reviews2 followers

May 27, 2018

Width over depth, but as an overview of Spark and its ecosystem, the book will do.

big-data

Delhi Irc

992 reviews24 followers

Read

July 10, 2018

Location: GG5 IRC, GG6 IRC, GG7 IRC, ND6 IRC
Accession No: DL029894-903

new-arrival-10-jul-2018

Gavin

Author 2 books560 followers

July 27, 2018

It's fine, covers everything shallowly. The API changes so frequently that you probably need this book: 95% of the Google hits for a given Spark feature are now either wrong or suboptimal.

Alp Oz

1 review

November 23, 2018

Must have in terms of the root mechanisms of the Spark but take account that all major APIs are continuously being changed so always consider the version