Jump to ratings and reviews
Rate this book

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

Rate this book
Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected, or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you'll also learn how to make it sing.

With this book, you'll explore:


How Spark SQL's new interfaces improve performance over SQL's RDD data structure
The choice between data joins in Core Spark and Spark SQL
Techniques for getting the most out of standard RDD transformations
How to work around performance issues in Spark's key/value pair paradigm
Writing high-performance Spark code without Scala or the JVM
How to test for functionality and performance when applying suggested improvements
Using Spark MLlib and Spark ML machine learning libraries
Spark's Streaming components and external community packages

358 pages, ebook

Published May 25, 2017

90 people are currently reading
294 people want to read

About the author

Holden Karau

14 books20 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
30 (23%)
4 stars
72 (56%)
3 stars
19 (14%)
2 stars
7 (5%)
1 star
0 (0%)
Displaying 1 - 15 of 15 reviews
Profile Image for Bugzmanov.
235 reviews100 followers
December 7, 2017
Up-to chapter seven the book is superb and deserves 4-5 stars for being thorough and providing good insights into spark internals.

I've especially enjoyed "Chapter 6. Working with Key/Value Data" that showed iterative approach to designing a computational pipeline, laying out every pitfall and issue one can encounter and providing approach to overcome them. It is a very good illustration of a point that most straightforward and readable solution will not necessarily perform or even work well in real distributed big data environment.

Unfortunately starting from chapter 7 it's just space filling garbage, not worth reading. I guess it could be 4-5 awesome blog posts, but now it's one mediocre book.
Profile Image for Vishwanath.
45 reviews8 followers
November 15, 2018
Going to keep going back to this one for debugging concepts, resource allocation and effective transformations.
Profile Image for Leo.
336 reviews26 followers
October 16, 2022
Upd. on re-read: Sadly, this book (as most "framework" tech books) aged quickly, and poorly. Some chapters are still very interesting, and appendix with debugging / tuning advice makes sense, but a lot of content is currently irrelevant. Chapters about ML and streaming can be safely skipped.
I'd really love to see updated edition, plux concentration on under-the-hood things and tuning.

Nice book, though it shouldn't be read like a textbook, more like a documentation, when you open the chapter you're interested in right now and using the advice that you just read. Maybe, first time read the book briefly, without any details, and when you'll have any spark-related troubles or questions - you'll know where to look for the answer.
14 reviews
February 26, 2023
The typography and flow can be hard to follow at times. The book mentioned Dataframe, datasets, and rdd but didn’t spend enough time explaining their relations. Then it starts throwing these terms in casual explanations which I find hard to follow. That kind of writing style significantly slowed down my reading. There are bits and pieces of gems I picked up along the way up to page 55 though.

Also I was looking for a book that can go into more implementation details of spark but the book seems to scratch the surface with high level summaries for the most part.
Profile Image for Kunal Tiwary.
1 review
May 24, 2020
A good read for knowing some intricacies when things don't work the way you expect in Spark.
It gives a good direction to troubleshoot performance bottlenecks and exposes general principles.
It could have been better if it talked more about DataFrame and Dataset concepts which is the way where things are currently.
Profile Image for Danyel Lawson.
98 reviews1 follower
January 10, 2019
Helps with understanding how spark works internally. You really need to understand the Yarn and Spark cluster parameters to greet spark to perform reliably with bigger jobs with lots of skew. This book doesn't get into that. It only deals with the jobs themselves.
Profile Image for Yuriy Mashtalir.
4 reviews2 followers
February 9, 2024
Even though ML and streaming parts went out of date very quickly, this is a great reference to keep in a library. First half of the book is extremely helpful! Following Holden on social media and conferences allowed me to find valuable tips and tricks
13 reviews
December 25, 2018
Probably one of the best book about Spark that I have ever read. This book provides much helpful understanding of how Spark internal and optimization.
Profile Image for Yingting.
6 reviews
November 19, 2019
This is a god sent book for how spark works and optimizing, tuning spark. The indepth technical details about spark.
```Where i would be looking to tune spark```
Profile Image for Łukasz Słonina.
124 reviews26 followers
November 9, 2020
Very technical, reads more like documentation, was expecting to have more context for each problem/solution.
Profile Image for Larry.
769 reviews2 followers
September 18, 2021
I've read the part I'm gonna read.
Indispensible handbook of Spark performance.
This 2017 book is really overdue for an update. 4 years is an eternity in this world.
Profile Image for Raman Yelianevich.
2 reviews
April 7, 2023
Chapters about RDD (operations and joins) are quite good, as well as examples. However, the part with Dataset and DataFrame is too basic and repeats the documentation.
Displaying 1 - 15 of 15 reviews

Can't find what you're looking for?

Get help and learn more about the design.