Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.
You'll explore the basic operations and common functions of Spark's structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark's scalable machine-learning library.
Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets--Spark's core APIs--through worked examples Dive into Spark's low-level APIs, RDDs, and execution of SQL and DataFrames Understand how Spark runs on a cluster Debug, monitor, and tune Spark clusters and applications Learn the power of Structured Streaming, Spark's stream-processing engine Learn how you can apply MLlib to a variety of problems, including classification or recommendation
Unnecessarily long. Could have been written in 40 pages. You don't have to explain things again and again, we get it, it's easier than Hadoop.
Anyway since it's so low level, didn't help me much at all, anyway I have to write code with annotations so the low-level understanding didn't help me much, because in a collaborative structured project, there's no need to find optimization in spark. (Everyone use HiveQL anyway but shhhh! Don't let the buyers know that.)
Un libro interesante. No es malo. Pero el principal problema es que trata de abarcar multiples tópicos complejos: desde machine learning, realización de etls, paquetes de Spark, producción de aplicaciones hasta streaming. Lo que hace un libro a mitad del camino entre un cookbook y un libro teórico.
El overview de los primeros capítulos es muy interesante; Aunque un poco desactualizado segun he estado consultando blogs especializados y la propia API de Spark.
Creo que sería más útil tener dos libros: uno que trate sobre los asuntos relacionados al diseño de aplicaciones de consumo de datos desde una óptica práctica. Funcionando como el cookbook de Spark del libro: "Designing Data Intensive Applications"; y otro que trate las librerías de machine learning disponibles. Las parte de Streaming lo eliminaría quitaría (Parte V) para profundizar más sobre las Low-Level APIs (Parte III)
Si uno está interesado en aprender sobre procesamiento de datos con Spark: I. Gentle Overview of Big Data and Spark II. Structured APIs—DataFrames, SQL, and Datasets IV. Production Applications
Si uno está interesado en realizar entender como se implementan algoritmos de machine learning en spark (no es un libro para aprender desde cero, sino para saber como hacerlo en spark): I. Gentle Overview of Big Data and Spark VI. Advanced Analytics and Machine Learning
I've used this book (along with "Learning Spark") to pass my Spark Databricks Certification exam. In Chapter I, II and IV it covers enough ground to get a good grip on the architecture and "how it works", but, as with anything code related, the best way to really learn the DataSet / DataFrame API is to just start using it.
There was a bit of repetition that perhaps wasn't needed though.
+1 for having Scala and Python examples!
What could definitely be better are some of the architecture drawings. Using just random shapes and assigning them in text as "executor" / "driver" is much less clear than just labeling them explicitely. Also some of the figures had no legend. Stuff like that :-)
I'm going to mark this one as 'finished'. There are additional sections on streaming Spark and machine learning that I'm probably not going to work through in detail.
Excellent resource. This is my primary Spark reference. I really got a lot out of the examples at github.
The only real drawback of this book is, it is way overdue for a new edition. It's a 2018 book that covers Spark 2.0. Spark 3.0 was released last year with a lot of important changes and we are now on 3.1.
It's strange to see architecture diagram of quality worse than "System design" whiteboard interviews - strange circles and squares with different colors? (book is not printed in color). Also, there was a question O'reilly editors which sneaked into the main text. And overall, a lot of focus was on API, Streaming and ML - with no e2e scenario for Big Data, how Spark can work together with Data Lakes, etc.
A solid introduction to Apache Spark, covering its architecture and the practical use of its APIs (with examples in both Python and Scala). Readers with some prior knowledge of distributed systems will go through some sections more easily. While the book does a good job explaining core concepts, it would have benefited from a deeper exploration of performance optimizations and troubleshooting techniques.
Comprehensive with its own shortcomings: Engineering best-practices are too shallow. Touched very lightly on cluster management and Spark UI inspection (A daily chore for most). Repetitive and cross-referenced too much (While saying "due to space limitations go read xxx"). Arbitrarily drawn architectural figures. Might be helpful for those to are newly exposed to Spark.
Lives to the expectation of the title. For all levels of readers but familiarity with Scala or Python is needed. Examples in both the languages. Tuning and optimization should have been covered in more detail.
Although I did not cover all the chapters/parts, Streams and ML parts were mainly skipped, it was a decent book to follow and learn through. The DataFrame API was nicely organized and well correlated which was my main goal to study this book.
Like any technical book, you don’t have to read it from cover to cover. Some of its chapters stand on their own, helping to build intuition without getting bogged down in syntax details—which is useful. Plus, it’s well-written that it’s worth giving it a try.
A nice book for a beginner. They cover a lot of ground and I especially liked the Spark Streaming part at the end of the book but they do not go very deep into each concept.
This has to be the most poorly edited book I've ever read. some examples: there is a figure with boxes that represent two different kinds of components. the way to tell the components apart is by their shading. however the shading for all the boxes in the figure is exactly the same. there are long running code examples that could not possibly compile, and violate basic principles of Scala programming (e.g. case classes treated as mutable objects). And there are TODO style notes still present in the text, such as, something like "talk to the people at O'Reilly to get the names of some books to fill out this section."
Many sections of the book provide only surface level treatment of a topic, or leave out discussion of topics entirely as "outside the scope of this book." Not that there is anything wrong with this in general, but definitely not what I expect from a book entitled "The Definitive Guide."
Setting aside these two flaws, this is an excellent book. I sincerely hope they do more proofing and editing for the next edition.
Really between 4 & 5 stars because of some discrepancies in examples, etc.
But, it's really good book about current version of Spark (2.2 & some mentions of 2.3). The book is mostly concentrated on the DataFrames, in contrast with other Spark books that mostly talking about RDDs.
A lot of useful information, including Structured Streaming, Machine learning, and even short description of GraphFrames.
if you can use additional data sets from the internet, then this makes for brilliant reading. The examples are just introductory, therefore, using additional data sets to work out different scenarios will really benefit.
It's fine, covers everything shallowly. The API changes so frequently that you probably need this book: 95% of the Google hits for a given Spark feature are now either wrong or suboptimal.
Must have in terms of the root mechanisms of the Spark but take account that all major APIs are continuously being changed so always consider the version