Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected, or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you'll also learn how to make it sing.
With this book, you'll explore:
How Spark SQL's new interfaces improve performance over SQL's RDD data structure The choice between data joins in Core Spark and Spark SQL Techniques for getting the most out of standard RDD transformations How to work around performance issues in Spark's key/value pair paradigm Writing high-performance Spark code without Scala or the JVM How to test for functionality and performance when applying suggested improvements Using Spark MLlib and Spark ML machine learning libraries Spark's Streaming components and external community packages
Up-to chapter seven the book is superb and deserves 4-5 stars for being thorough and providing good insights into spark internals.
I've especially enjoyed "Chapter 6. Working with Key/Value Data" that showed iterative approach to designing a computational pipeline, laying out every pitfall and issue one can encounter and providing approach to overcome them. It is a very good illustration of a point that most straightforward and readable solution will not necessarily perform or even work well in real distributed big data environment.
Unfortunately starting from chapter 7 it's just space filling garbage, not worth reading. I guess it could be 4-5 awesome blog posts, but now it's one mediocre book.
Upd. on re-read: Sadly, this book (as most "framework" tech books) aged quickly, and poorly. Some chapters are still very interesting, and appendix with debugging / tuning advice makes sense, but a lot of content is currently irrelevant. Chapters about ML and streaming can be safely skipped. I'd really love to see updated edition, plux concentration on under-the-hood things and tuning.
Nice book, though it shouldn't be read like a textbook, more like a documentation, when you open the chapter you're interested in right now and using the advice that you just read. Maybe, first time read the book briefly, without any details, and when you'll have any spark-related troubles or questions - you'll know where to look for the answer.
The typography and flow can be hard to follow at times. The book mentioned Dataframe, datasets, and rdd but didn’t spend enough time explaining their relations. Then it starts throwing these terms in casual explanations which I find hard to follow. That kind of writing style significantly slowed down my reading. There are bits and pieces of gems I picked up along the way up to page 55 though.
Also I was looking for a book that can go into more implementation details of spark but the book seems to scratch the surface with high level summaries for the most part.
A good read for knowing some intricacies when things don't work the way you expect in Spark. It gives a good direction to troubleshoot performance bottlenecks and exposes general principles. It could have been better if it talked more about DataFrame and Dataset concepts which is the way where things are currently.
Helps with understanding how spark works internally. You really need to understand the Yarn and Spark cluster parameters to greet spark to perform reliably with bigger jobs with lots of skew. This book doesn't get into that. It only deals with the jobs themselves.
Even though ML and streaming parts went out of date very quickly, this is a great reference to keep in a library. First half of the book is extremely helpful! Following Holden on social media and conferences allowed me to find valuable tips and tricks
This is a god sent book for how spark works and optimizing, tuning spark. The indepth technical details about spark. ```Where i would be looking to tune spark```
I've read the part I'm gonna read. Indispensible handbook of Spark performance. This 2017 book is really overdue for an update. 4 years is an eternity in this world.
Chapters about RDD (operations and joins) are quite good, as well as examples. However, the part with Dataset and DataFrame is too basic and repeats the documentation.