Sunday, February 17, 2008

The Rules and Data shortcut-- doing things the ELT way

Lately I've been working on a very data-intensive project, one that needs to run hundreds of business rules against millions (maybe billions) of rows of data kept in a database. There's a constraint issue here-- our rule engine likes all the facts (data) in 'Working Memory' before it gives you the results you want, so we can't run the rules against *all* the data all at once. This means either serializing the process (which would take much too long, and would leave us vulnerable to changes to the database as time goes by) or distributing the task (which leaves us to work out recovery and restart schemes and looking for ways to manage timing issues, as we'll probably do this asynchronously.)

An idea has emerged, though, that might help us in many ways. If we resort to doing 'ELT' processing (Extract, Load, Transform-- *not* ETL), we can use the database to do our heavy lifting for us. A really great side-benefit is that we can do this with a small number of SQL statements instead of a whole bunch of data access classes and a bunch of business rules. We also get benefit of having the database state much less vulnerable to partially successful operations, as the SQL can easily be made transactional.

Here's a quick 'method 1 vs. method 2' comparison:

Method 1:
Determine what kind of data the rules require.
Pull the data (using Hibernate generated DAOs), put it in Working Memory.
Fire the Rules (which have to know the particulars of the Entity objects).
Put the rule results back in the database.
*Note: The data is 'pulled out' of the db, then the results are 'put back in'.

Method 2:
Using temporary tables and SQL, do as much of the rule work as possible in the database.
Do simple calculations, putting results in temporary tables that are created and destroyed as often as needed. Make small jumps in state from 'step' to 'step'. This will make more SQL, but simpler SQL.
After an 'End state' temp table is populated with the results of the rule evaluation, put the rule results back in the database. (In this case, an update OR insert, depending if we already had a result for some of the Entities under evaluation.) Some databases (including MySQL) give you convenience functions to handle inserts (if no row with this key) or update (if key exists).
*Note: The data never 'leaves' the database. The database doesn't necessarily work any harder (because it's doing fewer compute-sensitive large joins, selects, etc.) instead of a whole lot more run-of-the-mill DAO operations (gets).

Since the db is doing all the work, all we need is a single client to make the SQL requests instead of an army of distributed clients in Scenario #1. This makes for much simpler recovery and restart in the event something goes wrong (and fewer machines and instructions working, so fewer opportunities for things to go wrong.)

I'm much more of an application developer than a DBA, so this logic-in-SQL idea is not my first choice, but I think I like this ELT stuff. I'll just have to be careful to keep the SQL as simple as possible, as this seems to be the strongest negative that I can find in this situation.

No comments: