Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark. In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script. With this book, you
I read only the first 5 chapters, the rest was not much of an interest for me. The book starts with a really simple and fundamental problem which is nice. The author provides a couple of different solutions and compares them in terms of performance and also investigates how they actually work under the hood.
But that was all. There are no other problems like that. The rest is like teaching you the functionality of spark RDD API. I am a little bit disappointed by this approach. I was expecting the author to provide new (and even more interesting) problems in each chapter.
That way, the book would be 5/5. Now it's just 3/5.
helpful overview, but a little uneven. some kind of questionable python in places. a little confused in how much math to bring to the table—for example, makes a surprising amount of reference to group and category theory, but without a whole lot of discussion for the uninitiated.
spends more time on RDDs than Dataframes, which is a bit unfortunate.
overall, probably worth skimming if you’re new to the space. may be more useful in tackling specific problems…