It's almost 2018 and all the hype about Spark, one of the hottest Apache's project, still hasn't died down. On the contrary. Having one of the most active communities out there results in many new features released. We need documentation, examples, blogs and books. It's actually getting complicated to find titles that are up to date because by the time a book gets approved and printed, that super technology has already evolved. Spark for Python Developers is one of the few books available to us covering PySpark, the result of Spark and Python. Published at the end of 2015, it tries to get the reader started with PySpark through the step by step build of a data-intensive application.
Before diving into the details, a couple of words. First, the book it's two years old already. For Spark it's a lot. But the age has little to no effect on my overall evaluation. Which is negative, for a very simple reason: it's out of context. This book does not teach any PySpark. It does show an example of how to process batch and real time data. Little more. Is this what the reader is looking for? Well, it is not what I was looking for, for sure. Neither the preface, nor any other part of the book states the reader is supposed to know already Spark and its architecture. What is a RDD anyway? What's happening inside when I read a file? What are the options? What's the typical work flow?
Nothing of this is answered. It's all taken for granted. But if the reader is supposed to know Spark already, then why does the first chapter explains how to get it installed? Moving forward, the music is the same. There is little of what we really want. The second chapter, for example, shows how to get tokens and registration needed to fetch data through APIs from social media such as GitHub, Twitter and MeetUp. Step by step guides with plenty of colorful images help the reader to get the job done. You can't get lost, definitely. But still, it's out of the context, despite being the social media needed to build that application. Later on, as the author starts building, we still find ourselves with guides on how to install even more software, such as Mongo.
Speaking of the application, well, there is no real application but pieces of code copy/pasted from iPython.
While overall I find interesting playing with Twitter data, I am very disappointed with this text. As a beginners' book, I would expect the author to cover the basics. How is a PySpark program build? What are the objects I can create? How are they used? Of course, the official documentation and API reference come to the rescue.
Searching through bunches of resumes of potential candidates and looking for familiar jargon in the job description is not enough to find out the ideal fit. It is all about having in-depth knowledge of the technology stack and development tools needed in python programming. It would help if you comprehended the correlation between several technologies and their structure. It will be a great help if you learn to assess the complete picture of the interviewee’s experience, figure out how they grew their skills over time, what projects they have worked in, and what tools they have expertise in using.
Pretty good book to start off. At few places more description was necessary to understand the concepts. But overall it gives a good overview and introduces to all the things which you might need in Spark.
It is a short, but very useful book. I found the chapter on learning from data using Spark quite interesting. The book is well structured. However, a bit more depth on each topic will make it even better.