Rate this book

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

Name: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
Rating: 3.98 (15 reviews)
ISBN: 9781491943151

Holden Karau

Rate this book

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected, or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you'll also learn how to make it sing.

With this book, you'll explore:

How Spark SQL's new interfaces improve performance over SQL's RDD data structure
The choice between data joins in Core Spark and Spark SQL
Techniques for getting the most out of standard RDD transformations
How to work around performance issues in Spark's key/value pair paradigm
Writing high-performance Spark code without Scala or the JVM
How to test for functionality and performance when applying suggested improvements
Using Spark MLlib and Spark ML machine learning libraries
Spark's Streaming components and external community packages

GenresTechnologyComputer ScienceTechnicalProgrammingComputersUnfinishedNonfiction

358 pages, ebook

Published May 25, 2017

90 people are currently reading

294 people want to read

About the author

Holden Karau

14 books20 followers

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

30 (23%)

4 stars

72 (56%)

3 stars

19 (14%)

2 stars

7 (5%)

1 star

0 (0%)

Displaying 1 - 15 of 15 reviews

Bugzmanov

235 reviews100 followers

December 7, 2017

Up-to chapter seven the book is superb and deserves 4-5 stars for being thorough and providing good insights into spark internals.

I've especially enjoyed "Chapter 6. Working with Key/Value Data" that showed iterative approach to designing a computational pipeline, laying out every pitfall and issue one can encounter and providing approach to overcome them. It is a very good illustration of a point that most straightforward and readable solution will not necessarily perform or even work well in real distributed big data environment.

Unfortunately starting from chapter 7 it's just space filling garbage, not worth reading. I guess it could be 4-5 awesome blog posts, but now it's one mediocre book.

2017

Vishwanath

45 reviews8 followers

November 15, 2018

Going to keep going back to this one for debugging concepts, resource allocation and effective transformations.

Leo

336 reviews26 followers

October 16, 2022

Upd. on re-read: Sadly, this book (as most "framework" tech books) aged quickly, and poorly. Some chapters are still very interesting, and appendix with debugging / tuning advice makes sense, but a lot of content is currently irrelevant. Chapters about ML and streaming can be safely skipped.
I'd really love to see updated edition, plux concentration on under-the-hood things and tuning.

Nice book, though it shouldn't be read like a textbook, more like a documentation, when you open the chapter you're interested in right now and using the advice that you just read. Maybe, first time read the book briefly, without any details, and when you'll have any spark-related troubles or questions - you'll know where to look for the answer.

Yunjiang Jiang

14 reviews

February 26, 2023

The typography and flow can be hard to follow at times. The book mentioned Dataframe, datasets, and rdd but didn’t spend enough time explaining their relations. Then it starts throwing these terms in casual explanations which I find hard to follow. That kind of writing style significantly slowed down my reading. There are bits and pieces of gems I picked up along the way up to page 55 though.

Also I was looking for a book that can go into more implementation details of spark but the book seems to scratch the surface with high level summaries for the most part.

Kunal Tiwary

1 review

May 24, 2020

A good read for knowing some intricacies when things don't work the way you expect in Spark.
It gives a good direction to troubleshoot performance bottlenecks and exposes general principles.
It could have been better if it talked more about DataFrame and Dataset concepts which is the way where things are currently.

Danyel Lawson

98 reviews1 follower

January 10, 2019

Helps with understanding how spark works internally. You really need to understand the Yarn and Spark cluster parameters to greet spark to perform reliably with bigger jobs with lots of skew. This book doesn't get into that. It only deals with the jobs themselves.

Yuriy Mashtalir

4 reviews2 followers

February 9, 2024

Even though ML and streaming parts went out of date very quickly, this is a great reference to keep in a library. First half of the book is extremely helpful! Following Holden on social media and conferences allowed me to find valuable tips and tricks

Alex Ott

Author 3 books208 followers

November 9, 2017

Packet with a lot of useful information about Spark...

big-data

Frank Palardy

Author 3 books6 followers

December 9, 2017

This info is in other books but it helped me to read it again.

java-jvm

Bao Nguyen

13 reviews

December 25, 2018

Probably one of the best book about Spark that I have ever read. This book provides much helpful understanding of how Spark internal and optimization.

Yingting

6 reviews

November 19, 2019

This is a god sent book for how spark works and optimizing, tuning spark. The indepth technical details about spark.
```Where i would be looking to tune spark```

Łukasz Słonina

124 reviews26 followers

November 9, 2020

Very technical, reads more like documentation, was expecting to have more context for each problem/solution.

Larry

769 reviews2 followers

September 18, 2021

I've read the part I'm gonna read.
Indispensible handbook of Spark performance.
This 2017 book is really overdue for an update. 4 years is an eternity in this world.

Raman Yelianevich

2 reviews

April 7, 2023

Chapters about RDD (operations and joins) are quite good, as well as examples. However, the part with Dataset and DataFrame is too basic and repeats the documentation.