Shane on June 29th, 2007

My latest debacle loading this massive table has to do with removing duplicates before I try adding a primary key to the table. I had to restart the load process a couple times before I finally settled into a method that worked well. (Running multilpe instances of DTExec on different subsets of the source data, all bulk-loading the same heap.)

I figured that rather than waiting to clean out dups when I created them, that I could simply write some scripts to clean them out afterwards? Right?

That works fine if you’re loading a few million rows, but when you’re talking HUNDREDS OF MILLIONS, you start to hit some walls. My usual method of removing duplicates is to (1) stage a set of distinct rows into a temp table, (2) delete rows from the original table using the temp table for keys, and (3) re-inserting rows from the temp table. Well, I have 50M unique rows with duplicates, and the SELECT DISTINCT trys to sort this set in tempdb, which quickly uses up all my disk space.

I need a way to do in-place deletes. In ORACLE, you can use ROWID to self-join the table and delete the unwanted rows. In SQL Server, I can use a CTE with row_number() but I’m really not sure if this will hit the same wall with tempdb.

The solution I found is this: DELETE TOP(n) FROM tablename

I set-up a cursor to loop through the table and DELETE TOP(n-1) rows for all cases with duplicates. It doesn’t take a lot of tempdb space. We’re in simple recovery, so it won’t generate a lot of logging. I think this is the best solution.

Here are the scripts:

  1. We need some sample data.


    select *
    into #dups
    from (
    select 1 col, 'blah' dat
    union all
    select 1 col, 'blah2' dat
    union all
    select 1 col, 'blah3' dat
    union all
    select 2 col, 'blah' dat
    union all
    select 2 col, 'blah2' dat
    union all
    select 2 col, 'blah3' dat
    union all
    select 2 col, 'blah4' dat
    union all
    select 3 col, 'blah' dat
    union all
    select 3 col, 'blah2' dat
    union all
    select 4 col, 'blah' dat
    ) a

  2. Here is the meat. We create a cursor with the unique values and counts, and use DELETE TOP (n-1) to run our row-by-row DELETE.


    SET NOCOUNT ON

    DECLARE @col int, @count int

    DECLARE cur CURSOR FORWARD_ONLY FAST_FORWARD READ_ONLY FOR
    SELECT col, count(*) row_count
    FROM #dups
    GROUP BY col
    HAVING count(*) > 1

    OPEN cur
    FETCH NEXT FROM cur INTO @col, @count

    WHILE @@FETCH_STATUS=0
    BEGIN
    DELETE TOP (@count-1) FROM #dups
    WHERE col = @col
    FETCH NEXT FROM cur INTO @col, @count
    END

    CLOSE cur
    DEALLOCATE cur

  3. And finally, our output data.
    1	blah3
    2	blah4
    3	blah2
    4	blah
    

    In my case, all the values for the non-key columns are the same, but I made them different in this case to show how DELETE TOP works.

The thing that bothers me about this solution is that it doesn’t batch up the deletes. I don’t want to run one delete because I’ll run out of log space, but it would be nice to batch it up somehow. Any ideas?

Update: I added “SET NOCOUNT ON” to avoid spooling the delete log to the client. Also, I found that for my massive update, it was quite helpful to run the process in parallel.

Tags: ,

Shane on June 28th, 2007

Dear Shane,

Next time trust me. SELECT INTO is always faster than UPDATE when you’re talking about updating the entire table. Even if the table has three NVARCHAR(MAX) fields and you’re only updating one 4 byte INT column.

Rather than run an update for 1hr 6min before you decide to cancel said query and trigger a 2hr 51min rollback of the transaction. Rather than writing a cursor to perform an in-place update of said heap using the only index you have on it. Just SELECT INTO a new table. This will update 168M rows in 48 minutes.

Listen to me next time. And save yourself some time.

Sincerely,
Your Instincts

Tags: ,

Shane on June 28th, 2007

When I was working with Oracle, I got hooked on using MINUS and INTERSECT. Well, it turns out that Microsoft implemented these functions in a slightly different way in SQL 2005.

Jeff Smith lays out a nice example here.

Tags: ,

Shane on April 19th, 2007

My friend Gary is fulfilling his life-long dream to take a trip around the world. This being the internet age, we get to follow along from the comfort of our living rooms without having to get the shots. I encourage you to check it out. Gary is one of the funniest and smartest people I know, so it will be entertaining, if nothing else.

He posted his first real update today from the big Island of Hawaii.

Tags: ,

Shane on April 2nd, 2007

Here’s a little BI humor for anyone from one of my favorite BI writers: Donald Farmer.

Tags: ,

Shane on February 12th, 2007

Also on the subject of checksum testing, this query will generate a checksum query for a given table. Use this query to generate the checksum for each table, then customize the sum() clauses if you did any transformations on your data.


select case a.xtype
when 167 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+','
when 175 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+','
when 231 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+','
when 106 then 'SUM('+a.name+') as sum_'+a.name+','
when 56 then 'SUM('+a.name+') as sum_'+a.name+','
when 104 then 'SUM(CONVERT(int,'+a.name+')) as sum_'+a.name+','
when 61 then 'SUM(MONTH('+a.name+')+DAY('+a.name+')+YEAR('+a.name+')) as sum_'+a.name+',
SUM(LEN(MONTH('+a.name+'))+LEN(DAY('+a.name+'))+LEN(YEAR('+a.name+'))) as sum_len_'+a.name+',' -- datetime
else a.name+' '+convert(varchar,a.xtype)
end col_type
from syscolumns a, sysobjects b
where a.id=b.id and b.xtype='U'
and upper(b.name) = 'MYTABLE'
order by b.name, a.name, a.colid

Tags: , ,

Shane on January 31st, 2007


select upper(b.name) as table_name, upper(a.name) as column_name,
a.colid, a.length, a.xprec, a.xscale, a.xtype,(CASE WHEN a.xtype=167 then 'varchar'
WHEN a.xtype=175 then 'char'
WHEN a.xtype=61 then 'datetime'
WHEN a.xtype=106 then 'decimal'
END) as col_type
from syscolumns a,
sysobjects b
where a.id=b.id
and b.xtype='U'
and upper(b.name) <> 'DTPROPERTIES'
order by b.name, a.name, a.colid

Tags: , ,