Sunday, July 17, 2011

ETL-- what is it, how it can be useful to you

What's ETL, and what can it do for me?

ETL is short for "Extract, Transform, Load", and it's a well established discipline of IT. ETL tools are usually fronted with graphical front-ends that facilitate putting together processes that work on streams of data. The data can be filtered, augmented, sliced into pieces, and put into new repositories. (That's E,T,E and L if you're a discriminating reader.)

So what can this do for you? In my day-to-day job, I used to work for a company that worked with gobs of demographic data, so we used ETL all the time. Mostly handling large files, splitting them into fields, validating values, adding values where none was there, and loading databases. But now that I'm doing something else, I still see plenty of uses for it. Got a log file you need to parse to look for trends? ETL can help. Do you have a list of Garbage Collection stats you'd like to work over? Another good use case. How about writing out some XML or JSON from a source in another form? Check, check and check. Would you like to automatically validate the format of some data, and send an email out if it needs some work? Another good case. Almost any time you have a reasonable amount of data and want to add to it, subtract it, split it into streams, validate it or place it somewhere else or in a different format, ETL should be one of the tools you consider.

ETL vs. ESB vs. Integration Framework

Is ETL better than an ESB (like Mule or JBoss ESB)? How about a lightweight integration framework (like Camel or SwitchYard)? ETL is often accompanied by a nifty user interface that lets you quickly drag-n-drop components onto a canvas, making for a very productive user model. ETL is almost always used to process large volumes of data, not single transactions. So while an ESB or an Integration Framework's usual use case might be onesy transactions, the ESB is going to go after a lot more than one record. (The records can come from a database, a flat file, an XML document, or other stores. The idea is that there is probably more than one of them.) This is not a concrete rule-- ESB and Integration Frameworks can process files, for instance, or an ETL tool can be made to operate on single records-- but it's almost always true. ETL is more about files (or groups of records) than it is about a single message.

Where can I get one of those?

ETL tools have been around a long time. Commercial versions exist that contain many nice features, but I really like an Open Source version called 'Kettle'. Like many successful Open Source products, Kettle is available in a commercial version and in the 'project' version that's freely available. I'd urge you to download Kettle and give it a shot if you haven't done so already.

OK, I got the free one. Now how do I get started?
The Kettle download comes with a 'Getting Started' guide that should have you running a sample flow in less than an hour. Better yet, you'll see all the cool drag-n-drop operations you can use and will soon be hacking together flows to munge data faster than you ever thought possible.

We interrupt this blog posting for a brief Book Review...

I've been inspired to write this blog post after reading a new book from Packt Publishing, "Pentaho Data Integration 4 Cookbook". Now that we have a baseline understanding of what kind of ETL we're talking about, let's talk a bit about this book.

The book does not teach Kettle basics. If you don't know how to start Kettle, you're going to have to read the user doc, because you won't find it here. What you will find is a sizable collection of best practices for accomplishing things with Kettle. (Many of the patterns will apply to other ETL products as well, though the implementation steps will be different.) Here are some of the things you might expect, and will find in this book:

- How to insert, update, and delete from a database
- How to alter a database at runtime
- How to read and write fixed flat files, CSV files, and Excel documents
- Reading and writing XML documents
- Shuffling files in the usual ways (copying, moving, remotely copying and moving), deleting
- Writing reports
- Sending emails

I'd expect any decent book on ETL to cover all those topics. These things are all basic uses of Kettle and would quickly be needed by many users. But wait! The book goes deeper, providing advanced use cases that will leave many advanced users puzzled. These include:

- Changing a database connection at runtime
- Parsing unstructured files (like a log4j log file)
- Validating XML against a schema or DTD
- Matching data based on 'fuzzy' matching algorithms
- Augmenting data with web services
- Splitting a flow into 'sub-Transformations' to facilitate management and re-use
- Effectively dealing with your data stream-- forking it into different streams for 'good' rows and 'bad' rows, and treating each differently

It's this second list of topics that really makes this book valuable. Honestly, Kettle is fairly intuitive to use, so a competent developer will probably figure out the basics without much assistance. The difference between a basic user and an expert will be the ability to figure out the more advanced use cases, and that's where this book provides a nice value-add.

The book can be found here.

I'm pretty excited about using a few of the tricks I've picked out of the book to help me with some recurring tasks. One of these tasks involves preparing data for R, the statistical package. (If you're not an 'R' user, please look back in this blog or elsewhere on the internet to see what all the hoopla is about. It really is nice!) Prior to this, I'd usually use Python and Bash scripts to massage and extract data from the sources I want to examine, but this actually can get to be a bit tedious. Using the drag-n-drop Kettle interface, though, I hope to apply my newfound knowledge to help me quickly stage data for advanced analysis. Wish me luck in this endeavor as I wish you good luck in finding uses for your problems as well. We're lucky-- never before have we had such a wealth of freely available tools to help us solve problems!

Happy Kettling!

Saturday, July 2, 2011

Book Review for "Oracle JRockit The Definitive Guide"

Have you ever worked with someone who knew a lot about the JVM? I did once, and it was an eye opening experience. Not only was this guy the right person to go to for performance and GC questions, he also was confident he'd written his programs for optimal performance. I was so impressed I decided this must be one of the characteristics of a superior programmer, and decided I'd try to improve myself in this regard.

I've just finished "Oracle JRocket The Definitive Guide" and I think this book has helped me on my journey to Java mastery. The early chapters are a little deep for an application programmer, but they still make for interesting reading. (Just how deep? How about discussions of the assembly code generated for your Java source? The very lowest levels of the JVM are discussed.)

After the lowest level workings of the JVM are covered, we encounter some material that's more likely to be of use to the average Java coder. We start learning about memory management, including in-depth coverage of garbage collection. (This is sure to be of use to any serious Java developer.) Java constructs of interest like Soft References, Phantom References, etc. are discussed. IMHO, these are topics that are beyond basic Java-- another tool in your Java toolkit!

There's a whole chapter on threading and synchronization. Once again, treading ground beyond that occupied by Java programmers at the basic level. Here you'll encounter topics like thin and fat locks, spin locks, and the innards of how threading is implemented on a JVM. If you're starting to get the feel this book is partly about computer science, not just programming, I'd think you may be right.

There's a lot of material related to understanding how the JVM is working. (This includes benchmarking, profiling, and understanding the running JVM.) It goes without saying this is all of high value to any Java programmer.

The final chapters of the book are devoted to the tools associated with JRockit. JRockit comes with a bunch of handy tools to help you monitor and control your JVM. These chapters show you all the slick GUI features and low-level switches you can use now that you understand what the JVM is up to.

Final impressions? This is an interesting book, because it's written hardest-materials-first. If you can hang with the early chapters you stand a lot to gain.

Who is this book good for? Any Java programmer will benefit from the early chapters. (They aren't specific to JRockit.) The latter chapters will be of use only to JRockit users, but since it's now freely available under a user friendly license, these tools are available to everyone.

It's a big book-- well over 500 pages, and none of it is fluff. I liked this book, it can be found here.

Happy 'Rockiting!