Wednesday, June 23, 2010

More Top 3 Take-Aways from Enzee Universe

Top 3 take-aways from Netezza Conference, day 2

Once again, here are the top 3 take-aways I got from sessions at "Enzee Universe" this year. If you're not sure what a "Netezza" is, please see yesterday's notes.

"Expanded Capacity, Improved Performance & More: Netezza Performance Software Release 6.0 Under the Hood"
1) The Netezza has a part called the "Field Programmable Gate Array". This is sort of like a chip that you can program, so you can sort of write functions to augment SQL. Data is automatically compressed as it travels between the disks and the FPGAs-- at a minimum 2x, but it can be up to 32x compression!
2) For users of ordered data (like dates), you can use "Zone Maps". This means you give Netezza a clue about your data and it will order it and make metadata about the high and low values contained in each block on the disk. Then when a read operation comes along, Netezza can just look to the metadata to see if that value is going to be between the high and low values-- if it isn't, Netezza just skips that block.
3) In 6.0, Netezza now allows "Alter Table" without actually re-creating the whole table behind the scenes.

"Practical Applications and Challenges of Advanced Analytics" by Usama Fayyad
1) Think of segmentation as a way to reduce data. By grouping like items into segments, you can disregard (or discard) the mounds of source data that go into the segmentation algorithms.
2) Mr. Fayyad used to be the 'Director of Data' at Yahoo, where they added 25 TB to the corporate stack each day. He offered the opinion that dedicated grids for handling this data (he even singled out Yahoo's Hadoop grid, something I like the idea of) are very expensive. Wherever possible, he suggested using rented grids (like Amazons) if conditions allow it.
3) He had an interesting example of a retailer who once asked what kinds of queries users were making that did NOT result in hits on the retailer's web site. It turned out that people were viewing print ads and selecting accessories (belly-button rings, to be exact) the models were wearing but were not part of the clothing being advertised. Once given this knowledge, the retailer was able to make inventory stocking decisions that allowed sales of the belly button rings, too.

"TwinFin Advanced Analytics Tutorial & Technical Underpinnings" by Brian Hess
(This was a really good session. There were many great points in this one, but here are the top 3)
1) Netezza now offers Hadoop Map/Reduce style programming right on the machine. With only minor modifications to your existing Hadoop classes, you can forklift the jobs right onto Netezza. Unfortunately, the larger Hadoop eco-system (Zookeeper, for example) is not available so you have to sequence your jobs yourself.
2) The R statistical language is now available right on the box, too. They tried to follow R's data constructs as closely as possible (i.e. an R 'dataframe' is a Netezza 'table') so it should be a snap for R users to work on the Netezza. For my uses, I imagined using R to handle data profiling duties.
3) On Netezza, the user can provide their own extensions to SQL (my terminology there) in the form of "User defined functions" that are used on the Field Programmable Gate Arrays (FPGA). Formerly you could only do this in C, but now you can use C, C++, Java, Python, or Fortran. We saw some example Java, it looked very straightforward. I think this is a great feature.

"Netezza Data Compliance"
1) Netezza now comes with 'Mantra', a data compliance tool meant to help with SOX, CISP, and other data regulatory needs.
2) Mantra will help you monitor who is accessing which data, and at which times they are accessing it. It will alert you to off-hours data access or other unusual access patterns.
3) Mantra isn't tied just to the Netezza-- it can help monitor other databases, flat files, mainframe resources, etc.

"A View into the Future -- Netezza's Technology RoadMap" by Phil Francisco
1) Netezza is adding 2 new classes of machines. 'Cruiser' is a much bigger data storage device that can hold up to 10 Petabytes in one machine. 'Skimmer' is sort of an edge device, maybe meant to help take load off the primary Netezza machine similar to how mainframe controllers sometimes reduce load on the primary box. (That's my guess on Skimmer, though. I might be wrong on that one.)
2) Netezza's Twin Fin has had improvements in concurrency, where it now handles 40-50 queries per second. In the future they'd like to get up to around 200 per second.
3) In the future, they'd like to support loads in the 10 TB/hour range.

That's it for day 2! If you're new to massive data warehousing, some of these notes may have been a little odd. But I hope you've found at least something of interest.

'Till next time,

Happy Coding!

No comments: