Wednesday, February 22, 2012

Book Review for MEAP "Hadoop in Practice"


Big Data is a hot topic, and a fast moving one as well. In that workspace, Hadoop is a big player. This early access edition shows the book's sweet spot: the areas other books have missed.

Hadoop, mature enough to have been recognized in mainstream media, is still fast moving. Like any framework, it constrains it's users to fairly rigid usage patterns-- but the users are finding ways around these. This book introduces you to some of these, opening up uses of Hadoop that would otherwise be out of bounds for you.

The book is a MEAP edition, which means it contains less than the full content. In this case, it means the book contains just a few chapters, good for 176 pages. The table of contents promises about a dozen more chapters and a few more appendices, so there is potential this could be a big book when it's done. But time will tell how those final chapters take shape, the ones that are present are rich enough that a little consolidation wouldn't be surprising.

The first chapter introduces the basics of Hadoop, and includes some excellent diagrams. Pictures can often bring clarity that words don't, and I really like to see plain and simple pictures to help me grasp the big picture. This book does well in this regard. Besides the overview, we get a quick glimpse of related and complimentary technologies, restrictions of using Hadoop and alternatives to using Hadoop. Versions of Hadoop are covered in two dimensions: the various distributions, and what's contained in forward-looking iterations. The chapter wraps up with a brief section on installing and configuring Hadoop for a first run.

The next chapter we're given is chapter 3, it covers data serialization tools and techniques. More good pictures are found in this chapter, as are explanations of how you can use Hadoop to process XML, JSON, Google's ProtocolBuffers, and Facebook's Thrift. Each of these gets their own section, explaining how you might use them. There are plenty of references to Elephant Bird, an open source project maintained by Twitter. You also learn how to handle custom file formats if you need them.

The final chapter of this early access book is on HDFS tuning techniques. The author tells you why Hadoop is not well suited to processing loads of small files, and how you can get around this limitation. The chapter also covers choosing the best compression and codec for your particular needs. When working with large amounts of data, choosing the right tools for compression can make huge differences in performance, so the contents of this chapter should be of high interest to those who are heading for production environments.

So, what's the verdict? I found the book's contents to be of high value and reflective of real-world knowledge that Hadoop users will require. I don't think the book is suitable as a sole resource for new users of Hadoop. (If that's your case, I'd suggest buying two books-- one to learn the basics, then this one for when you've gone past the newbie phase.) The book is fairly raw-- the chapters seem a little thrown together in places, and the content is short of what the table of contents promises. But you'll get updates with your MEAP purchase, and you can have some valuable content now. All things considered, I'd recommend this book for Hadoop users who are beyond the initial learning phases.

The book can be found here.

Happy Hadooping!

No comments: