Wednesday, June 23, 2010

More Top 3 Take-Aways from Enzee Universe

Top 3 take-aways from Netezza Conference, day 2

Once again, here are the top 3 take-aways I got from sessions at "Enzee Universe" this year. If you're not sure what a "Netezza" is, please see yesterday's notes.

"Expanded Capacity, Improved Performance & More: Netezza Performance Software Release 6.0 Under the Hood"
1) The Netezza has a part called the "Field Programmable Gate Array". This is sort of like a chip that you can program, so you can sort of write functions to augment SQL. Data is automatically compressed as it travels between the disks and the FPGAs-- at a minimum 2x, but it can be up to 32x compression!
2) For users of ordered data (like dates), you can use "Zone Maps". This means you give Netezza a clue about your data and it will order it and make metadata about the high and low values contained in each block on the disk. Then when a read operation comes along, Netezza can just look to the metadata to see if that value is going to be between the high and low values-- if it isn't, Netezza just skips that block.
3) In 6.0, Netezza now allows "Alter Table" without actually re-creating the whole table behind the scenes.

"Practical Applications and Challenges of Advanced Analytics" by Usama Fayyad
1) Think of segmentation as a way to reduce data. By grouping like items into segments, you can disregard (or discard) the mounds of source data that go into the segmentation algorithms.
2) Mr. Fayyad used to be the 'Director of Data' at Yahoo, where they added 25 TB to the corporate stack each day. He offered the opinion that dedicated grids for handling this data (he even singled out Yahoo's Hadoop grid, something I like the idea of) are very expensive. Wherever possible, he suggested using rented grids (like Amazons) if conditions allow it.
3) He had an interesting example of a retailer who once asked what kinds of queries users were making that did NOT result in hits on the retailer's web site. It turned out that people were viewing print ads and selecting accessories (belly-button rings, to be exact) the models were wearing but were not part of the clothing being advertised. Once given this knowledge, the retailer was able to make inventory stocking decisions that allowed sales of the belly button rings, too.

"TwinFin Advanced Analytics Tutorial & Technical Underpinnings" by Brian Hess
(This was a really good session. There were many great points in this one, but here are the top 3)
1) Netezza now offers Hadoop Map/Reduce style programming right on the machine. With only minor modifications to your existing Hadoop classes, you can forklift the jobs right onto Netezza. Unfortunately, the larger Hadoop eco-system (Zookeeper, for example) is not available so you have to sequence your jobs yourself.
2) The R statistical language is now available right on the box, too. They tried to follow R's data constructs as closely as possible (i.e. an R 'dataframe' is a Netezza 'table') so it should be a snap for R users to work on the Netezza. For my uses, I imagined using R to handle data profiling duties.
3) On Netezza, the user can provide their own extensions to SQL (my terminology there) in the form of "User defined functions" that are used on the Field Programmable Gate Arrays (FPGA). Formerly you could only do this in C, but now you can use C, C++, Java, Python, or Fortran. We saw some example Java, it looked very straightforward. I think this is a great feature.

"Netezza Data Compliance"
1) Netezza now comes with 'Mantra', a data compliance tool meant to help with SOX, CISP, and other data regulatory needs.
2) Mantra will help you monitor who is accessing which data, and at which times they are accessing it. It will alert you to off-hours data access or other unusual access patterns.
3) Mantra isn't tied just to the Netezza-- it can help monitor other databases, flat files, mainframe resources, etc.

"A View into the Future -- Netezza's Technology RoadMap" by Phil Francisco
1) Netezza is adding 2 new classes of machines. 'Cruiser' is a much bigger data storage device that can hold up to 10 Petabytes in one machine. 'Skimmer' is sort of an edge device, maybe meant to help take load off the primary Netezza machine similar to how mainframe controllers sometimes reduce load on the primary box. (That's my guess on Skimmer, though. I might be wrong on that one.)
2) Netezza's Twin Fin has had improvements in concurrency, where it now handles 40-50 queries per second. In the future they'd like to get up to around 200 per second.
3) In the future, they'd like to support loads in the 10 TB/hour range.

That's it for day 2! If you're new to massive data warehousing, some of these notes may have been a little odd. But I hope you've found at least something of interest.

'Till next time,

Happy Coding!

Top 3 take-aways from Enzee Universe

I've been at "Enzee Universe", the world-wide Netezza conference and offer these top 3 take-aways from the sessions I've seen at day 1.

What's a Netezza?
A Netezza is a database appliance. That means it's a refridgerator-sized box that has about 100 disk drives in it, with many cpus, and a high-speed network inside. It's optimized for data warehousing, so when you add data to a table the data automatically gets divided over the 100 disk drives. This dramatically lowers database operations, as each table is 1/100th the size it otherwise would have been, so scans happen 100 times faster.

"Best Practices With Netezza" by David Birmingham (author of "Netezza Underground")
1) Make floats into ints. Ints compress on the Netezza and thus are more efficient.
2) When making performance changes, consider all 3 components of the ETL/Data Warehouse/BI stack. Don't optimize one at the expense of another.
3) Use CTAS (Create Table As Select) often. Intermediate steps (making 'work tables') is good practice.

"Netezza 101" by Ed Patterson
1) The Netezza has a 10GB internal network, and can currently hold up to 7 Petabytes of storage.
2) The SCSI drivers allow up to 110 MB/second of data transfer, but since data compression is transparent and automatic (right next to the disk), effective rates are really more like 440 MB/second per disk.
3) The Netezza can currently load data at a rate of about 1 TB an hour.

Featured keynote by Donald Feinberg, Distinguished Analyst at Gartner
1) Memory is getting much bigger, now over 1 TB on a server. The future of data warehousing may be in-memory, not on SSD or Flash as some expect.
2) Memory is more expensive to buy than disk, but it requires only 1% of the electricity. This will become an important economic consideration.
3) There is a bright future in Predictive Analytics.

That's it! I hope you found something of interest in all that. If so, tune in again for another list from Day 2.

'Till then,

Happy Coding!

Sunday, June 13, 2010

Book Review for "Plone 3 Products Development Cookbook"

Let me make one thing clear to start: This is a book by developers, for developers. The authors clearly state in the Preface that the readers should have some knowledge of Python, Plone and Zope and I believe this is very true. There's little material here to get the newbie up to speed on basic concepts. But if you're already at that point, then this book reads like working notes from an expert Plone 3 consultant, and there is much worthwhile content here for you.

The first chapter is the only one that offers a newbie-level entry level. It covers Plone 3 installation, and gives some advice on establishing your initial Plone site. Once that chapters done, hang on tight-- the coddling is over! The text then jumps straight to recommendations for tools used for development. The authors clearly have a good amount of experience in adding functionality to Plone, and they offer great advice on which tools will be useful. Given the base of those two chapters, the book then launches right into the proper way to develop products for Plone 3.

The goals of the book are really outlined back in the preface. The authors have selected 10 pieces of functionality that are not found in a base Plone 3 installation. I found all 10 to be requests you might easily find in the 'real world'. (Examples: Prepare the website for internationalization, allow multimedia content that should be both playable on the site and downloadable. The rest are just as reasonable.)

Once you know what the authors are setting out to accomplish, you are presented with answers to all these challenges. The answers are formatted in a pattern repeated for each action that adds functionality:

Getting Ready - outlines installation prerequisites, the things you'll need to gather.
How to Do It - step by step instructions on how to implement your changes.
How It Works - after you've configured things in the previous step, this step explains why things work.
There's More - an optional section where further reading can be found, or maybe extras like test procedures.

Along the way the authors provide tips and techniques for expert Plone development. These include debugging, documentation, testing, and packaging. I was especially pleased that the authors took time to provide text on performance considerations, something not always present in books of this type.

If I had one wish for this book, it would be for more illustrations and a little more remedial material for developers not already knee-deep in Plone development. Outside that, I'd recommend that anyone doing Plone 3 development should look at this book. There are so many expert level tips and tricks contained here, I imagine nearly everyone is going to learn something-- many will learn many things.

The book can be found here.

Happy Reading!

Saturday, June 12, 2010

Profession in Crisis -- The Moving Ladder problem

Let me be clear about this: I love my work. Programming has been very good to me, and I've loved the challenges it presents. But I wonder if industry is making short-sighted decisions that will eventually bring crisis to our profession.

I work for a fairly large company (6,000 employees, a little over $1B in revenue). For the past few years we've been following what I think is typical industry practice: currently employed programmers are largely domestic, while new hires are sourced almost entirely from offshore data centers. At first, we had new-hire data centers in India, then we started in Poland and most recently in China.

By the way, this is not a rant against the offshore programmers. Who can fault anyone for seeking employment and providing for their families? Today's column is about a business problem, not the offshore debate.

Anyone who's ever worked with consultants will recognize that effective programming is comprised of two primary pieces: knowledge of the toolset (language, frameworks, configurations, etc.) and knowledge of the business domain. High-end consultants typically start an assignment with a lengthy debriefing in which business knowledge is described to them. This is necessary because it's impossible to make informed strategic decisions without a good understanding of the business context you're working in. The same is true of stay-in-place programmers: only entry-level coders bring value solely through toolkit knowledge. Mid-to-upper level coders have to understand the business, so they can make informed architectural level decisions. The higher up the ladder you go, the stronger the business/technical see-saw tilts towards business knowledge.

OK, here's the problem: With our moving-low-dollar-hiring model, we've effectively cut off the business domain accumulation pipeline. Today's entry level coders are tomorrow's application subject matter experts. What we have done is constantly moved the bottom rungs of the ladder, so nobody is being given the time to thoroughly understand how the complicated applications work. We've still got domestic experts on the top end of the ladder who learned the ropes while bottom-level jobs led directly to the top-level jobs. But noone is going to be able to climb to the top of the ladder in the future-- the bottom rungs keep moving.

What will happen when today's set of application architects retires or moves on? There won't be any mid-level experts in India-- those experts took other jobs when the coding work went to Poland. And there won't be any domain experts in Poland, either-- that feeder pipeline was shut down when the low-end work went to China. I'm sure China's surging economy will eventually price the Chinese coders out of the business too, and we will again move the bottom rungs of the ladder somewhere else.

I guess a pure capatilist would say that this is forcing efficiency somehow, that the business will adjust to this problem as it seeks to become more efficient. I just don't see that working out-- the people who know how to make effective strategic decisions are invariably those people who have had years, maybe decades to understand how a company's applications power the enterprise. I think it's a short-sighted business anti-pattern, and short-term greed is driving it.

Maybe the end result will just be sub-optimal business performance in the long run. Like all engineers, I just hate to see inefficiency, especially if it's something that's shaping up to manifest itself years down the line.

But for today, I've got a couple of new books to read. One's on Plone, the other on DSLs in Groovy. So I'm going to go study a little just-for-fun low level coding, just for the joy of learning new things. That's something all programmers across the globe can enjoy!

Happy Coding!

Sunday, June 6, 2010

Your allies in IT: The Ops crew

I've always worked in shops where the 'Development' side of the house produced applications and the 'Ops' side of the house ran those applications.

If you've never heard the term 'Ops', that means the folks that manage data storage, migrate programs from environment to environment (dev to test to cert to prod, etc), and FTP files back and forth to clients. If your company deals with printed materials (statements, bills, etc.) it's the ops folks that produce these. They might also manage tape libraries, migrate applications from old servers to new ones, or manage your virtualized environment. In short, they take all the stuff developers produce and put it to use so the company can make money. They're the ones that RUN your application.

Today, I'd like to present five things every developer should know about Ops.

Ops is NOT boring and repetitive work
Ops is about efficiency and proper execution. To be good in ops, you have to understand the resources the applications use (disk, processor consumption, printers, etc.) and know how to continually improve usage of these finite resources. This means lots of analytical thinking, and lots of adrenaline on days (nights, probably) when changes are made to the all important production eco-system. Ops work is most definately NOT boring unless your shop has had very good leadership for so long everything's already smoothed out and nothing ever changes.

Ops thinks Dev is a bunch of Prima Donnas
Somewhat true - Depending on the quality of your Dev shop, you may have a few unenlightened souls running around thinking they're superior to Ops in some way. Not true, Kemo Sabe-- and thinking that way is a good way to leave valuable bridges unbuilt. Many times I've had my skin saved by Ops personnel willing to work a little harder to restore a dataset or run some special fix-it job that will cover a mistake I'd made. (Sometimes they even concoct the scheme necessary to fix things.) It's easy for the Dev-Ops relationship to be somewhat adversarial, though, if there's not mutual respect. In that case the Ops guys can be like the fabled Chinese philosopher who just waited by the river for the body of his adversary to come floating by. In this case, a friendly Ops ally can fish you out and resuscitate you-- if you've invested in building the relationship. If you haven't they just have to wait and sooner or later laugh at you when your application goes belly up and you need a special favor.

Ops is a dead-end job - NOT!
Definately false. Ops gives you the perspective needed to understand your shop's core business. You have to know the way things work and the deadlines associated with application run times. Ops also has to have a handle on online (transactional) operations and the SLAs associated with them. Ops jobs can definitely lead to strong management positions (i.e. the CIO-type career path), if that's your desired career progression.

Ops knows who you are
They quickly pick up on who's who in the Dev side of the shop-- sharp Devs can be granted extra priveledges at crunch-time while Dev Bozos will almost certainly be subjected to every stringent by-the-book mandate they have available. Ops will ultimately be given this power, because they feel a lot of the pain when things go wrong. So if a particular Dev/Architect type has caused them problems in the past, that Dev is going to have a rougher time getting things done. (Getting things done is what ultimately leads to Dev promotions and pay raises, so this is an important consideration.)

Smart Ops shops pay a little more
It is true that some Ops jobs don't require a CS degree or any particular certifications. Your company can hire people with a limited skill set-- but only if your work is largely one-dimensional and so simplistic it really doesn't require much thought. But if your work is heterogeneous and you have many time-sensitive deadlines and resources to balance, your shop will be money ahead to pay for good analytical thinkers. The very best shops I've seen have hired entry-level people to start, then paid them well enough to keep them around for many years. If they've hired smart (though inexperienced) people, those people will accumulate valuable knowledge as time goes by and will yield an incredible ROI over time as they keep your IT machine smoothly running. A single salvaged SLA can pay for a lot of marginally bigger paychecks!

So if your shop runs production applications, I'd urge every Developer and Architect to adapt the right mindset towards our friends on the Ops side. It'll pay off, big time!

'Till next time,

Happy Coding!