Data Analysis with Python and PySpark is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale. This clear and hands-on guide shows you how to enlarge your processing capabilities across multiple machines with data from any source, ranging from Hadoop-based clusters to Excel worksheets. You’ll learn how to break down big analysis tasks into manageable chunks and how to choose and use the best PySpark data abstraction for your unique needs. By the time you’re done, you’ll be able to write and run incredibly fast PySpark programs that are scalable, efficient to operate, and easy to debug.
I was looking for a book dedicated specifically to PySpark (instead of Spark in general): * chapters 1-9 are quite informative and focused on developers' interests - so you don't get too much info on the overall Spark architecture - you're learning how to solve (simple) problems with Spark * unfortunately, the further, the less detailed the chapters get (quite counter-intuitively) - I had a feeling that the author is just to get into meaty, interesting stuff, and then - bam - the chapter is over * chapters 10 and 11 were supposed (IMHO) to be the opus magnum of the book :) unfortunately, they are too shallow, too rushed, and in the end, I don't think any of the questions I initially had got answered :( * TBH I skimmed through 12-14 quickly: I wasn't too interested in that way of mixing Spark processing with ML
My general opinion on this book is: it's enough to get you started but on a tutorial level. Don't expect the level of readiness required to do any actual, real-life work.
A very good introduction to Spark and its components. Does not take anything for granted, the author explains how APIs work, what is Python and SparkSQL: for instance he explains how JOINs work regardless of PySpark or SQL. So if you already know these languages it might have redundant, but still valuable info.
NOTES Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed to provide fast and general-purpose data processing capabilities. Spark extends the MapReduce model to support more types of computations, including interactive queries and stream processing, making it a powerful engine for large-scale data analytics.
Key Features of Spark: Speed: Spark's in-memory processing capabilities allow it to be up to 100 times faster than Hadoop MapReduce for certain applications. Ease of Use: It provides simple APIs in Java, Scala, Python, and R, which makes it accessible for a wide range of users. Versatility: Spark supports various workloads, including batch processing, interactive querying, real-time analytics, machine learning, and graph processing. Advanced Analytics: It has built-in modules for SQL, streaming, machine learning (MLlib), and graph processing (GraphX).
PySpark is the Python API for Apache Spark, which allows Python developers to write Spark applications using Python. PySpark integrates the simplicity and flexibility of Python with the powerful distributed computing capabilities of Spark.
Key Features of PySpark: Python-Friendly: It enables Python developers to leverage Spark’s power using familiar Python syntax. DataFrames: Provides a high-level DataFrame API, which is similar to pandas DataFrames, but distributed. Integration with Python Ecosystem: Allows seamless integration with Python libraries such as NumPy, pandas, and scikit-learn. Machine Learning: Through MLlib, PySpark supports a wide range of machine learning algorithms.
SparkSQL is a module for structured data processing in Apache Spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
Key Features of SparkSQL: DataFrames: A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Pandas. SQL Queries: SparkSQL allows users to execute SQL queries on Spark data. It supports SQL and Hive Query Language (HQL) out of the box. Unified Data Access: It provides a unified interface for working with structured data from various sources, including Hive tables, Parquet files, JSON files, and JDBC databases. Optimizations: Uses the Catalyst optimizer for query optimization, ensuring efficient execution of queries.
Key Components and Concepts Spark Core RDD (Resilient Distributed Dataset): The fundamental data structure of Spark, an immutable distributed collection of objects that can be processed in parallel. Transformations and Actions: Transformations create new RDDs from existing ones (e.g., map, filter), while actions trigger computations and return results (e.g., collect, count).
PySpark RDDs and DataFrames: Similar to Spark Core but accessed using Python syntax. SparkContext: The entry point to any Spark functionality, responsible for coordinating Spark applications. SparkSession: An entry point to interact with DataFrames and the Spark SQL API.
SparkSQL DataFrame API: Provides a high-level abstraction for structured data. SparkSession: Central to SparkSQL, used to create DataFrames, execute SQL queries, and manage Spark configurations. SQL Queries: Enables running SQL queries using the sql method on a SparkSession. Catalog: Metadata repository that stores information about the structure of the data.
This book is really excellent. A great intro to spark for those already familiar with python and sql. Some of the later exercises provide some good challenges too.
The only possible criticism is that this book occupies a reasonably niche are of spark use; data analysis and ML (in section 3). I didn't read the third section, as I won't be using much ML professionally in future . Plenty of folks use spark for ML, especially in databricks, but the most popular use of spark is still for Data engineering and etl pipelines. This is fine, the book is about something else, but some things one might expect for Data engineering use cases such as orchestration and spark, and data quality control with pydeequ are not present. Regardless of the use case, I would expect some more discussion of shuffling, it's effects, and strategies to avoid shuffling when joining large dataframes.
Since I gave the book Data pipelines with Apache Airflow 4/5 starts, I couldn't help but give this 3/5 as I didn't enjoy it as much.
Let's be frank... Spark really seems like a big system (libraries, setting up, configurations...) with a lot of nuisances. All in all this is a good book.
At times I felt like the author is going to quickly through some content or not explaining it for dummies. I'm also not sure whether it would be better to introduce the reader to spark internals before starting with the first application. Although I can understand the benefits of this approach.
I also found that I misunderstood some of the exercises after consulting the exercise solutions.
This is a good book to get started with writing PySpark code and gaining some understanding of what is going on. Some of the explanations were a bit confusing for me and I needed to consult other sources.
This book won't make you a PySpark intermediate, but it's good starting point for those interested in programming with PySpark.