Summary The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop. Foreword by Rob Thomas. About the technology Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem. About the book Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms. What's inside Writing Spark applications in Java Spark application architecture Ingestion through files, databases, streaming, and Elasticsearch Querying distributed datasets with Spark SQL About the reader This book does not assume previous experience with Spark, Scala, or Hadoop. About the author Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years. Table of Contents PART 1 - THE THEORY CRIPPLED BY AWESOME EXAMPLES 1 So, what is Spark, anyway? 2 Architecture and flow 3 The majestic role of the dataframe 4 Fundamentally lazy 5 Building a simple app for deployment 6 Deploying your simple app PART 2 - INGESTION 7 Ingestion from files 8 Ingestion from databases 9 Advanced finding data sources and building your own 10 Ingestion through structured streaming PART 3 - TRANSFORMING YOUR DATA 11 Working with SQL 12 Transforming your data 13 Transforming entire documents 14 Extending transformations with user-defined functions 15 Aggregating your data PART 4 - GOING FURTHER 16 Cache and Enhancing Spark’s performances 17 Exporting data and building full data pipelines 18 Exploring deployment
Examples at the Github repo are now in Java, Python and Scala, unlike the first edition which was just Java. But in the book, the author still just covers Java, and the Maven POMs only compile Java. I was able to get the Maven build working for Scala without too much effort.
In the first few chapters, it is very easy to get the code examples working. It's like a cookbook. But then in chapter 6, the author pushes the reader off a cliff. You need to come up with a cluster to deploy your Spark application without much help from the author. He does describe his hardware hobby project of building his own cluster server, but that is not going to be useful for most readers. I feel like this might have been the end of the road for me if I had been working through this book as a total beginner back in 2017-18. I was able to get through this using the Bitnami Spark docker image at https://hub.docker.com/r/bitnami/spark/.
Chapter 9 "Advanced Ingestion" was another hard chapter.
Chapters 11-15 on transformations are probably what is going to be of interest to most readers and arguably could have come earlier.
There is some pretty good reference material in the appendices which saves this from just being a long march tutorial.
The author keeps highlighting the value of functions throughout all his examples. I found this helpful; it's a topic I personally hadn't paid much attention to, but I realize now there is a lot of value there.
This might be the best relatively recent (2020) book about Spark. It covers Spark 3.