Saturday, March 10, 2012

Book Review: Hadoop In Practice MEAP update

Manning adds more content to the latest Hadoop book. Real-world users will benefit.

Manning offers books that are in "MEAP", which is a way for readers to peak at books as they are developed. As the book is written, Manning will periodically take what they have "so far" and make it available electronically. They're recently offered a few more chapters of "Hadoop in Practice", here's what they contain.

There are some minor changes to earlier content, but the biggest change is new content. Here are the new chapters:
Chapter 2, "Moving Data in and out of Hadoop"
Chapter 4, "Applying MapReduce Patterns to Big Data"
Chapter 7, "Utilizing Data Structures and Algorithms"

Chapter 2 deals with moving data into and out of Hadoop. It provides techniques for working with flat files, databases, and HBase. You're introduced to some tools that can help you with these tasks and ancillary needs like translating and aggregating the data. You're given ideas on how to push data from external sources to HDFS, and how to pull data from external sources directly into the MapReduce framework. You're also given an introduction to a scheduler that can help you repeat these tasks on a periodic basis, sure to be a production concern. All the examples contain instructions on how to obtain the helper components, how to build them if necessary, and how to configure and run them.

Chapter 4 provides suggestions to help optimize Big Data operations in MapReduce like joining, sorting, and partitioning.
"Joins" are familiar to most programmers, necessary to combine data from 2 different sources based on some specified criteria. You are given ideas on how to best handle Inner joins and Outer. You are given ideas on how you might do your joining on the Map side or the Reduce side, and when each idea is appropriate. You are also given some expert insights into how Secondary Sorting works, and how the MapReduce framework interacts with your Map and Reduce functions at this point in the life cycle.

Chapter 7 examines algorithms and presents some valuable patterns and algorithms you can apply to your big data problems. Graph theory is used to conceptualize problems like 'Shortest Distance', where you try to calculate the fastest way to traverse a graph of nodes. Other problems deal with things like determining which nodes is best associated with another, the famous PageRank algorithm, and use of Bloom filters.

In my opinion, these latest chapters add positive value to the book, I think it's shaping up nicely. The concepts presented reveal expert insight into real-world problems a Hadoop user would encounter. If you are a Hadoop user, you owe it to yourself to check this one out.

Happy Hadooping!



1 comment:

Ilias Tsagklis said...

Hi Rick,

Nice blog! Is there an email address I can contact you in private?