A practical guide to designing, testing, and implementing complex MapReduce applications in Scala This book is for developers who are willing to discover how to effectively develop MapReduce applications. Prior knowledge of Hadoop or Scala is not required; however, investing some time on those topics would certainly be beneficial. Programming MapReduce with Scalding is a practical guide to setting up a development environment and implementing simple and complex MapReduce transformations in Scalding, using a test-driven development methodology and other best practices. This book will first introduce you to how the Cascading framework allows for higher abstraction reasoning over MapReduce applications and then dive into how Scala DSL Scalding enables us to develop elegant and testable applications. It will then teach you how to test Scalding jobs and how to define specifications and behavior-driven development (BDD) with Scalding. This book will also demonstrate how to monitor and maintain cluster stability and efficiently access SQL, NoSQL, and search platforms. Programming MapReduce with Scalding provides hands-on information starting from proof of concept applications and progressing to production-ready implementations.
Scalding is a small but very powerful and expressive Scala DSL built on top of Cascading, itself a Java API that exposes relational algebra constructs that expand to Map and Reduce operators in the backend. Scalding was developed at Twitter and open sourced - it has reasonably good documentation on GitHub and support is available on their Google Groups mailing list. There is also Paco Nathan's Cascading book where Scalding gets a chapter. However, this is the first book devoted completely to Scalding, and it does a great job making the Scalding API accessible to a broader audience.
The target audience for this book is someone who is somewhat familiar with Scala and Hadoop, though not necessarily an expert at either. The author describes the behavior of various Scalding operations using before and after diagrams on small datasets which I thought was very helpful in understanding the API. The book covers the original fields based API and the typed API, and finally the Matrix API, all through case examples that increase in complexity as more advanced features are explained. There is also some coverage on making Scalding work with various NoSQL databases using custom Taps rather than just files.
I am not an expert at Scalding, but I have used it in the past so I was quite familiar with some features of the DSL. But having read this book, I have a much better idea of Scalding's capabilities and how I can use them.
DISCLAIMER - I did not buy this book, I requested a copy from a PackT representative because I was interested in learning more about Scalding and thought this book may help (I was right), and I thought my perspective as someone somewhat familiar with Scalding would be useful for other readers.
From the title, we can see that this book is about Big Data technology. MapReduce is a new programming model for large data processing. If you know Hadoop and its part, you’ll understand MapReduce. But this book doesn’t talk about Hadoop much, it mostly talks about large data processing with Scala. Scala is a kind of new programming language which looks like Java.
Two earlier chapter of this book tell you an introduction to MapReduce, a little bit of Hadoop and how to install it, and Scala language (Since Scalding is actually an API for Scala). But, if you’re new to MapReduce, I suggest you read Hadoop Beginner’s Guide from PacktPub first. That book’ll tell you much stories about MapReduce, Hadoop, and the concept behind. And if you’ve read it, you could pass the first chapter of this Scalding book.
The second chapter is the most important for newbies, escpecially for everyone who doesn’t know Scala yet. Later, you’ll find this book is getting more interesting. The 8th chapter is the most interesting for me, because it told me how to use external data (such as from SQL/NOSQL database) into your Scalding program. 9th chapter is also interesting, some popular data mining techniques combined with Scalding for large data sets processing. It must be useful for everyone doing data mining.
I really like this book's focus on Scalding. Programming MapReduce with Scalding offers clear, well-illustrated, smoothly paced how-to steps, as well as easy-to-digest definitions and descriptions. It takes you from setting up and running a Hadoop mini-cluster and local-development environment to applying Scalding to real-use cases. It also shows how to develop good test and test-driven development methodologies, run Scalding in production, use external data stores, and apply matrix calculations and machine learning.
The book is written for developers who have at least "a basic understanding" of Hadoop and MapReduce. But it is also intended for experienced Hadoop developers who may be "enlightened by this alternative methodology of developing MapReduce applications with Scalding."
It does help to be somewhat familiar with MapReduce, Scalding, Scala, Hadoop, Maven, Eclipse and the Linux environment. Still, Antonio Chalkiopoulo does a good job of keeping the examples accessible even when his readers are new to some of the packages.
Pros: * it's the first book about Scalding, it's great it exists * it succeeds in presenting the overall idea & concepts behind Scalding * formatting is significantly better than it was in previous PacktPub tech books * it *tries* to give you all the necessary information to start developing with Scalding. * the content in the book is very clear - if something is written, it's easily understandable, but ...
Cons: * ... sometimes author was just too savy with examples - the best examples is chapter 3, when he starts with describing the basic pipe operations, but he describes them all on 1 page with pretty much useless 1-liner examples; examples make much more sense in chapter 4 though (when things get a bit more complicated) * chapter 5 (design patterns) is far too concise - the clarification of why these 3 techniques are considered patterns is too weak and unconvincing * when I was reading about Scalding, my first questions were about the benefits & costs; let's assume I can imagine the benefits after reading this book, but I have completely no clue about the costs: the "tax" on generated map-reduce jobs, the limitations, the scenarios it just work, because it's inefficient, etc.
In general - it's nice as an introduction to Scalding, I can't say I didn't learned anything, but without further investigation it's not even possible to tell whether Scalding is applicable in scenarios I'm considering or not really.