Rate this book

Hadoop Operations

Name: Hadoop Operations
Rating: 3.93 (6 reviews)

Eric Sammer

Rate this book

If you've been asked to maintain large and complex Hadoop clusters, this book is a must. Demand for operations-specific material has skyrocketed now that Hadoop is becoming the de facto standard for truly large-scale data processing in the data center. Eric Sammer, Principal Solution Architect at Cloudera, shows you the particulars of running Hadoop in production, from planning, installing, and configuring the system to providing ongoing maintenance.

Rather than run through all possible scenarios, this pragmatic operations guide calls out what works, as demonstrated in critical deployments.Get a high-level overview of HDFS and MapReduce: why they exist and how they workPlan a Hadoop deployment, from hardware and OS selection to network requirementsLearn setup and configuration details with a list of critical propertiesManage resources by sharing a cluster across multiple groupsGet a runbook of the most common cluster maintenance tasksMonitor Hadoop clusters--and learn troubleshooting with the help of real-world war storiesUse basic tools and techniques to handle backup and catastrophic failure

GenresProgrammingComputersTechnology

ebook

First published January 1, 2012

19 people are currently reading

146 people want to read

About the author

Eric Sammer

4 books

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

20 (22%)

4 stars

46 (52%)

3 stars

19 (21%)

2 stars

2 (2%)

1 star

1 (1%)

Displaying 1 - 6 of 6 reviews

Todd N

359 reviews258 followers

November 24, 2012

[[[Obligatory disclosures: I read several early drafts of this book because I work at Cloudera, which is also the employer of Mr. Sammer. Also, I'm pretty sure that he thinks I'm a complete idiot.]]]

I imagine that dealing with the Pythagoreans was a lot like dealing with Hadoop experts nowadays. There are many mystical and esoteric secrets that are not known unless you get invited to the secret rites.

We unwashed heathens may somehow stumble upon the fact that 10 is the perfect number according to Pythagorus, but we will never, ever comprehend its mystical meaning and the proper way to arrange a tetrad and how to properly deploy it at a customer site without overloading the network.

But now that this book has been published, maybe a better analogy would be that Bruce Lee movie where he provokes the ire of the old masters by teaching the "forbidden style" and then has to fight them to prove his honor.

Either way, this is the single best book to buy if you are planning on setting up a Hadoop cluster or if you have just inherited one. The main focus is on keeping a cluster running and integrating it with existing systems (like Kerberos, your network fabric, etc.).

And because making the right decisions up front will save a lot of teeth gnashing and garment rending down the road, there are great overviews of important topics like selecting hardware, filesystem formatting, sizing, and configuration variables. It stands up to multiple rereadings.

As an extra bonus, this book has the best description of the fair scheduler that I have read. Even I, a person clearly unfit to touch the hem of a cluster, came close to understanding it. Because most clusters will eventually be shared resources, it's important to know how the resource sharing works if you need certain jobs to get done within a certain time.

If you are going to be developing for a cluster, buy the third edition of Tom White's Hadoop: The Definitive Guide. If you are going to be keeping a cluster running, buy this book. If you are one of them dev/ops people, buy both. Sacrifice a goat and read its entrails while you are at it. It can't hurt.

If you know Mr. Sammer, you will note a surprising dearth of f-words in this book. I'm hoping they will be put back in for the audio version.

big-data kindle

Delhi Irc

992 reviews24 followers

Read

September 23, 2015

Location: GG6 IRC
Accession No: DL027540
Location: GG7 IRC
Accession No: DL027541
Location: PTI IRC
Accession No: DL027542

Ritesh Chhajer

24 reviews1 follower

January 7, 2019

Traditional file systems like ext3 are implemented as kernel modules. HDFS instead is a user space file system meaning the file system code runs outside the kernel. Another difference is in block size. General purpose file systems use a 4K or 8K block size but HDFS raises this to 64M or 128M. In HDFS, there is no concept of current working directory. It’s more like a remote file system than a local OS file system. Copying files (put/get) is more like SCP/FTP.

Namenode stores its file system metadata in fsimage (Note: Block location not kept in fsimage) and edits (change log). fsimage contains a complete snapshot of filesystem metadata while edits contains only incremental modifications made to the metadata. Over time edits file grows and might take a long time to replay in event of a server failure, hence it is periodically checkpointed (every hour or when the edits file reaches 64M) with changes applied to fsimage file.

Mapreduce is relatively simple for developers in the sense no need to worry about threading, socket programming, etc. Simply operate on one record at at time. Map functions operate on these records and produce intermediate key-value pairs. The reduce function then operates on the intermediate key-value pairs, groups the keys together and produces aggregated results. A mapreduce job is made up of four distinct stages: 1) client job submission, 2) map task execution, 3) shuffle & sort, 4) reduce task execution.

Limitations of vertical scaling through powerful machines: High Cost w/ limited scalability. Default heap size for namenode is 1G for every 1 million blocks.

Mapreduce was the original framework for writing Hadoop applications. Hive, Pig popular tools to use Mapreduce for interacting with Hadoop. Now Spark is the new programming framework for writing Hadoop applications. It’s faster than mapreduce and supports processing of streaming data.

YARN (Mapreduce V2): Resource manager handles application master’s request for resources, runs a scheduler for allocating resources and tracks heartbeat from node managers. Node managers (worker nodes) communicates with resource manager by sending heartbeat providing status of nodes and launches application masters on request from resource manager.

Map tasks are almost always uniform in execution. If a single map task takes 1m to execute and consumes some amount of user and system CPU time, some amount of RAM and some amount of I/O, 100 maps tasks will simply take 100 times the resources. Reduce tasks, on the other hand, don’t have this property.

Historically, virtualization has hurt I/O performance sensitive applications because guest OSes are unaware of one another as they perform I/O scheduling operations and as a result can cause excessive drive seek operations. For all the reasons you would not run a high performance relational database in a VM, you should not run Hadoop in a VM.

vm.swappiness: This kernel parameter (0-100) controls the kernel’s tendency to swap applications data from memory to disk. Set it to 0.
vm.overcommit.memory: Set it to 1. Valid range of values 0 (Check if enough memory then allow allocation otherwise return error),1 (Permit memory allocation in excess of RAM + swap as defined by vm.overcommit.ratio), 2 (Always return success to an application’s request for memory)

Maxim

33 reviews1 follower

June 14, 2020

The book screams for a new updated edition (I've read the 2012 one): it's very outdated and not so practical. It covers mostly the Hadoop 1.X. The author mentions YARN, but focuses on the deprecated Jobtracker/Tasktracker model.

Nevertheless, there are a lot of valuable information and it should be read by professionals for the sake of completion.

3.5 / 5

data-processing

James Estes

6 reviews2 followers

March 1, 2013

This book is fantastic. I absolutely recommend it to anyone doing anything with Hadoop. Especially if you're setting it up and maintaining a cluster, but even if you're just writing Map Reduce jobs. When I first flipped through it I though it would just be a regurgitation of what is online, and tables of configs and their definitions. This is not the case. I've been using Hadoop and HBase for 2 years now, and I learned a lot here. From hardware and operating system tuning all the way to monitoring, Sammer explains the ins and outs of a Hadoop cluster without putting you to sleep.

Michael Economy

198 reviews287 followers

July 31, 2013

I kinda gave up because our install got delayed, but i think this book is a great resource for getting hadoop up and running in a serious prod environment.

work-related