If you're an R developer looking to harness the power of big data analytics with Hadoop, then this book tells you everything you need to integrate the two. You'll end up capable of building a data analytics engine with huge potential. Overview In Detail Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing. Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop. You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming. What you will learn from this book Approach Big Data Analytics with R and Hadoop is a tutorial style book that focuses on all the powerful big data tasks that can be achieved by integrating R and Hadoop. Who this book is written for This book is ideal for R developers who are looking for a way to perform big data analytics with Hadoop. This book is also aimed at those who know Hadoop and want to build some intelligent applications over Big data with R packages. It would be helpful if readers have basic knowledge of R.
A while back I had received an e-copy of "Big Data Analytics with R and Hadoop", by Vignesh Prajapati (Packt Publishing, 2013) and was asked to review it. Here are my thoughts:
The first three chapters (Getting Ready to Use R and Hadoop, Writing Hadoop MapReduce Programs, and Integrating R and Hadoop) are really hard to read, almost unnatural, with lots of instructions of how to install packages, and how to issue commands. It feels as if they are just a list of help pages, or FAQ lists. And they contain a lot of errors, some of which I tried to collate at the end of this review.
The next chapter, Using Hadoop Streaming with R, sounded really interesting, as I am already using Hadoop Streaming and was keen to see how it integrates with R. The chapter fell short of my expectations, as it is yet another long list of available configuration options. Apart from a simple map/reduce script , which is meant to aggregate page views on a website by the location of the visitor, the rest of the chapter is an assorted collection of runtime and configuration options, having the look and feel of long how-to page. Really disappointing...
Chapter 5 presents three examples for performing analytics with R and Hadoop: categorization of web pages, computing the frequency of stock market changes and predicting the sale price of blue book of bulldozers (a Kaggle competition). In the first one, two excerpts of map/reduce R scripts are given, without any mention of their purpose and what they are trying to achieve. The second one is superficially presented. The third one attempts to show how to build a predictive model with random forests, but it is full of errors and would need re-writing in order to make the presentation of the material more accessible to the reader.
Chapter 6 is about using various machine learning techniques. The categorization of ML algorithms to three categories, namely supervised, unsupervised and recommender systems is pretty strange, because recommender systems by themselves are not a category; it's just an application of machine learning. The presentation of two supervised algorithms (linear and logistic regression) using the map/reduce paradigm is interesting and illustrates its power. The chapter carries on with k-means clustering as an example of a supervised technique, and concludes with collaborative filtering, as an example of ML algorithm used in recommender systems. Overall, this is the most interesting chapter in the book.
The last chapter presents various ways to import and export data from R, using Excel MySQL, MongoDB, SQLite, Postgres, Hive, and HBase. Hive and HBase belong to the Hadoop ecosystem and are relevant. The reference to Excel, though, using wording like "..Excel is a spreadsheet application developed by Microsoft to be run on Windows and Mac OS, which has a similar function to R for performing statistical computation, graphical visualization, and data modeling." is really weird, to say the least...
All in all, the book is not well written and is particularly hard to read, while the presentation of the material is inadequate. Apart from Chapter 6, the rest of the book feels like a collection of badly written reference material. Unless someone is actually looking for such a book, I wouldn't recommend it.
ps:
The following is a list of some of the errors I spotted in the book:
page 5, third paragraph: “Big data usually includes datasets with sizes” , should read “Big data usually includes datasets with big (or huge etc) sizes”
page 8, second paragraph:
“The options would be to run analysis on limited chunks also known as sampling or to correspond the analytical power of R with the storage and processing power of Hadoop and you arrive at an ideal solution” should probably read:
“The options would be to run analysis on limited chunks also known as sampling or to complement the analytical power of R with the storage and processing power of Hadoop and you
arrive at an ideal solution”.
page 13 (Chapter 1): This needs re-phrasing: “R will not load all data (Big Data) into machine memory. So, Hadoop can be chosen to load the data as Big Data.”
page 18: the sections “Performing data modeling in R” is actually talking about “data mining in R” . Also, the definition given about data modeling is broadly describing predictive data mining instead, such as classification or regression.
page 18: “This techniques highly focus on past user actions..” , should read: “These techniques highly focus on past user actions..”
page 38: “. Both the Map and Reduce functions maintain MapReduce workflow.”, should read: “Both Map and Reduce functions maintain the MapReduce workflow”.
page 39: the last paragraph at the bottom is mostly copied over from page 29, first paragraph at the top.
page 40: The first paragraph at the top doesn’t really make sense. It starts with the data input elements that cannot be updated (?), it carries on mentioning that if the key-value pairs are changed, then this change cannot be reflected in the input files (??), and it ends with a mention to the parallel nature of MapReduce operations.
page 40, last paragraph, last sentence therein: “All the data splits will be processed by TaskTracker for the Map and Reduce tasks in a parallel manner.”, should read: “All the data splits will be processed by the TaskTracker for the Map and Reduce tasks in a parallel manner.”
page 41, at the top, in the Sqoop section: “Suppose your application has already been configured with the MySQL database and you want to use the same data for performing data analytics, Sqoop is recommended for importing datasets to HDFS.” There shouldn’t be a comma there.
page 42, in the Shuffling and sorting section, the first two sentence in the second paragraph don’t make sense:
“The Combiner is often the Reducer itself. So by compression, it's not Gzip or some similar compression but the Reducer on the node that the map is outputting the data on.”
page 42, same section: “ The data returned by the Combiner is then shuffled and sent to the reduced nodes.”, should read: “ The data returned by the Combiner is then shuffled and sent to the Reducer nodes.”
page 42, in the ‘Reducing phase execution’ section: “As soon as the Mapper output is available, TaskTracker in the Reducer node will….”, should read: “As soon as the Mapper output is available, the TaskTracker in the Reducer node will….”
page 47: right after the figure with the MapReduce data-flow, the whole paragraph that describes the various Hadoop versions (New/Old) is irrelevant for that section.
page 50, second bullet point: “The factory RecordWriter used by OutputFormatto to write the output data in the appropriate format” should read: “The factory RecordWriter is used by OutputFormatto write the output data in the appropriate format”.
page 138, at the top: "..datasets that comprises approximately 4,00,000 training data points", should read : "... datasets that comprise of approximately 4,000,000 training data points".
Book Review: Big Data Analytics with R and Hadoop Disclaimer: I was provided with a free review copy of this book from the publisher.
R and Hadoop are the two big things in data science at the moment and a book showing clearly how the two integrate should be an absolute must read, right? Well, maybe so but I am afraid this book is not it.
The author says the book is both for R developers who want to do big data analytics and for Hadoop users who want to learn to integrate R. These are two very different audiences and I'm afraid that both will come away disappointed. For a start, the book is very light on actual R code; you need to get a third of the way through before you actually see any and then much of it just repetitive "Here is some code for installing package x from CRAN". The text is full of jargon and grammatical errors. Many of the key concepts (mapper and reducer functions, HDFS) are poorly explained while the author goes out of his way to tell you that Google is "a web search engine for finding relevant pages relating to a particular topic" or that Twitter is a "social networking site for finding messages".
In contrast to the very lightweight R content of the book, other sections are taken up with long stretches of uncommented Java code, XML and shell configuration commands.
The description of setting up a Hadoop cluster seems fairly comprehensive, though it is offered as a simple "do this then do that" with very little reason or insight. Then three methods of integrating R and Hadoop are introduced but there is no discussion of which is better for which purpose and users are merrily told to install external package after external package with no explanation as to why they are being installed.
The chapter on machine learning is almost unbelievable cursory. It gives a very high level introduction to the topic but doesn't define important terms and drops in code performing advanced topics like linear solvers with no explanation of what they even are. This chapter also highlights possibly the greatest flaw in the book: that R may not actually be a very good choice for doing this kind of work. The great advantage of R is its massive range of libraries to do just about any machine learning or statistics task. According to the book, to do machine learning tasks like linear or logistic regression (the idea that there are ML tasks in the first place is questionable) in R over Hadoop you can't use these libraries but instead you need to roll your own basic and inefficient regression algorithms within the mappers and reducers.
The next section was a run through of installing various different database packages and their R bindings. I found this the most useful but I couldn't see much of a connection between these and big data analytics. For each technology, there was a jargon-y list of features (looking like they were lifted verbatim from online help files) with no suggestion given as to why each feature might be useful or even relevant.
This is the first book on R and Hadoop and as such should be applauded. They are both new technologies (Hadoop certainly, R in terms of big data analytics) and the book does give some insight on how to use them together and there is some good information there, if you are prepared to work for it. However, the problems I have outlined above and the lack of worked examples mean that I can't honestly recommend this book over simply going online and googling for tutorials and searching through help files.
https://www.goodreads.com/book/show/1... am a practitioner of Big Data Analytics and value the contribution made by Vignesh Prajapati's Big Data Analytics with R and Hadoop. Hadoop has become an essential instrument to analyze big data and R had already developed several packages to conduct data analysis. Thus, combining Handoop and R provide us the advantage of conducting statistical analysis with the power of Hadoop super computer, which is great. The greatest feature of Prajapati's book is presenting in easy manner and well written communication the value of putting Hadoop and R together. This is book is structured in a very friendly way. First, it presents the most basic concepts of Hadoop and R, which enable new practitioners to get introduced to programming environment. Second, it includes the programming and clear examples of big data analytics. For me, as practitioner, that is the most important part I value because it provides solid and clear examples of conducting bid data analytics. I think there are very few books like this. I cannot recommend enough prospects reader to get acquainted with this book.
Gives a nice overview of what can be done when combining R and Hadoop with a lot of example code. Sometimes the text is not neat, which makes it a little bit clumsy to read. I think there's serious potential for improvement.