What's ETL, and what can it do for me?
ETL is short for "Extract, Transform, Load", and it's a well established discipline of IT. ETL tools are usually fronted with graphical front-ends that facilitate putting together processes that work on streams of data. The data can be filtered, augmented, sliced into pieces, and put into new repositories. (That's E,T,E and L if you're a discriminating reader.)
So what can this do for you? In my day-to-day job, I used to work for a company that worked with gobs of demographic data, so we used ETL all the time. Mostly handling large files, splitting them into fields, validating values, adding values where none was there, and loading databases. But now that I'm doing something else, I still see plenty of uses for it. Got a log file you need to parse to look for trends? ETL can help. Do you have a list of Garbage Collection stats you'd like to work over? Another good use case. How about writing out some XML or JSON from a source in another form? Check, check and check. Would you like to automatically validate the format of some data, and send an email out if it needs some work? Another good case. Almost any time you have a reasonable amount of data and want to add to it, subtract it, split it into streams, validate it or place it somewhere else or in a different format, ETL should be one of the tools you consider.
ETL vs. ESB vs. Integration Framework
Is ETL better than an ESB (like Mule or JBoss ESB)? How about a lightweight integration framework (like Camel or SwitchYard)? ETL is often accompanied by a nifty user interface that lets you quickly drag-n-drop components onto a canvas, making for a very productive user model. ETL is almost always used to process large volumes of data, not single transactions. So while an ESB or an Integration Framework's usual use case might be onesy transactions, the ESB is going to go after a lot more than one record. (The records can come from a database, a flat file, an XML document, or other stores. The idea is that there is probably more than one of them.) This is not a concrete rule-- ESB and Integration Frameworks can process files, for instance, or an ETL tool can be made to operate on single records-- but it's almost always true. ETL is more about files (or groups of records) than it is about a single message.
Where can I get one of those?
ETL tools have been around a long time. Commercial versions exist that contain many nice features, but I really like an Open Source version called 'Kettle'. Like many successful Open Source products, Kettle is available in a commercial version and in the 'project' version that's freely available. I'd urge you to download Kettle and give it a shot if you haven't done so already.
OK, I got the free one. Now how do I get started?
The Kettle download comes with a 'Getting Started' guide that should have you running a sample flow in less than an hour. Better yet, you'll see all the cool drag-n-drop operations you can use and will soon be hacking together flows to munge data faster than you ever thought possible.
We interrupt this blog posting for a brief Book Review...
I've been inspired to write this blog post after reading a new book from Packt Publishing, "Pentaho Data Integration 4 Cookbook". Now that we have a baseline understanding of what kind of ETL we're talking about, let's talk a bit about this book.
The book does not teach Kettle basics. If you don't know how to start Kettle, you're going to have to read the user doc, because you won't find it here. What you will find is a sizable collection of best practices for accomplishing things with Kettle. (Many of the patterns will apply to other ETL products as well, though the implementation steps will be different.) Here are some of the things you might expect, and will find in this book:
- How to insert, update, and delete from a database
- How to alter a database at runtime
- How to read and write fixed flat files, CSV files, and Excel documents
- Reading and writing XML documents
- Shuffling files in the usual ways (copying, moving, remotely copying and moving), deleting
- Writing reports
- Sending emails
I'd expect any decent book on ETL to cover all those topics. These things are all basic uses of Kettle and would quickly be needed by many users. But wait! The book goes deeper, providing advanced use cases that will leave many advanced users puzzled. These include:
- Changing a database connection at runtime
- Parsing unstructured files (like a log4j log file)
- Validating XML against a schema or DTD
- Matching data based on 'fuzzy' matching algorithms
- Augmenting data with web services
- Splitting a flow into 'sub-Transformations' to facilitate management and re-use
- Effectively dealing with your data stream-- forking it into different streams for 'good' rows and 'bad' rows, and treating each differently
It's this second list of topics that really makes this book valuable. Honestly, Kettle is fairly intuitive to use, so a competent developer will probably figure out the basics without much assistance. The difference between a basic user and an expert will be the ability to figure out the more advanced use cases, and that's where this book provides a nice value-add.
The book can be found here.
I'm pretty excited about using a few of the tricks I've picked out of the book to help me with some recurring tasks. One of these tasks involves preparing data for R, the statistical package. (If you're not an 'R' user, please look back in this blog or elsewhere on the internet to see what all the hoopla is about. It really is nice!) Prior to this, I'd usually use Python and Bash scripts to massage and extract data from the sources I want to examine, but this actually can get to be a bit tedious. Using the drag-n-drop Kettle interface, though, I hope to apply my newfound knowledge to help me quickly stage data for advanced analysis. Wish me luck in this endeavor as I wish you good luck in finding uses for your problems as well. We're lucky-- never before have we had such a wealth of freely available tools to help us solve problems!