Warning: ob_start(): non-static method wpGoogleAnalytics::get_links() should not be called statically in /home/scolin/blog.scolin.com/wp-content/plugins/wp-google-analytics/wp-google-analytics.php on line 259
Audit Your ETL With CHECKSUM « from the gut

Loading your target incrementally offers a huge performance benefit over running full truncate/reloads, but there is a danger of missing inserts or updates in your source system. It can take hours or days to track down the source of these problems (if it can be done at all!) and problems are generally not found until the customer picks up on it.

So anything you can do to verify the target data with the source is good for your business. You can do simple rowcounts, but that’s an unreliable test. For example, if you’re missing one inserted row and one deleted row, your rowcount test will still pass.

A good checksum technique is a lightweight alternative that can be run asyncronously or run after the load is complete. In this example, I am querying a SQL Server target and using ODBC to connect to an Oracle source to get a list of keys that are not syncronized.

SELECT T.key, S.key, T.checksum, T.num_rows, S.checksum, S.num_rows
FROM (SELECT key, count(*) num_rows,
sum(len(char_col)+len(num_col)+num_col+...) checksum
FROM target_tablegroup by key) T
SELECT key, count(*) num_rows,
sum(length(char_col)+length(num_col)+num_col+...) checksum
FROM source_table
GROUP BY key') S ON T.key = S.key
WHERE S.num_rows <> T.num_rows
OR S.checksum <> T.checksum

Tags: , ,

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>