My latest debacle loading this massive table has to do with removing duplicates before I try adding a primary key to the table. I had to restart the load process a couple times before I finally settled into a method that worked well. (Running multilpe instances of DTExec on different subsets of the source data, all bulk-loading the same heap.)
I figured that rather than waiting to clean out dups when I created them, that I could simply write some scripts to clean them out afterwards? Right?
That works fine if you’re loading a few million rows, but when you’re talking HUNDREDS OF MILLIONS, you start to hit some walls. My usual method of removing duplicates is to (1) stage a set of distinct rows into a temp table, (2) delete rows from the original table using the temp table for keys, and (3) re-inserting rows from the temp table. Well, I have 50M unique rows with duplicates, and the SELECT DISTINCT trys to sort this set in tempdb, which quickly uses up all my disk space.
I need a way to do in-place deletes. In ORACLE, you can use ROWID to self-join the table and delete the unwanted rows. In SQL Server, I can use a CTE with row_number() but I’m really not sure if this will hit the same wall with tempdb.
The solution I found is this: DELETE TOP(n) FROM tablename
I set-up a cursor to loop through the table and DELETE TOP(n-1) rows for all cases with duplicates. It doesn’t take a lot of tempdb space. We’re in simple recovery, so it won’t generate a lot of logging. I think this is the best solution.
Here are the scripts:
- We need some sample data.
select *
into #dups
from (
select 1 col, 'blah' dat
union all
select 1 col, 'blah2' dat
union all
select 1 col, 'blah3' dat
union all
select 2 col, 'blah' dat
union all
select 2 col, 'blah2' dat
union all
select 2 col, 'blah3' dat
union all
select 2 col, 'blah4' dat
union all
select 3 col, 'blah' dat
union all
select 3 col, 'blah2' dat
union all
select 4 col, 'blah' dat
) a
- Here is the meat. We create a cursor with the unique values and counts, and use DELETE TOP (n-1) to run our row-by-row DELETE.
SET NOCOUNT ON
DECLARE @col int, @count int
DECLARE cur CURSOR FORWARD_ONLY FAST_FORWARD READ_ONLY FOR
SELECT col, count(*) row_count
FROM #dups
GROUP BY col
HAVING count(*) > 1
OPEN cur
FETCH NEXT FROM cur INTO @col, @count
WHILE @@FETCH_STATUS=0
BEGIN
DELETE TOP (@count-1) FROM #dups
WHERE col = @col
FETCH NEXT FROM cur INTO @col, @count
END
CLOSE cur
DEALLOCATE cur
- And finally, our output data.
1 blah3
2 blah4
3 blah2
4 blah
In my case, all the values for the non-key columns are the same, but I made them different in this case to show how DELETE TOP works.
The thing that bothers me about this solution is that it doesn’t batch up the deletes. I don’t want to run one delete because I’ll run out of log space, but it would be nice to batch it up somehow. Any ideas?
Update: I added “SET NOCOUNT ON” to avoid spooling the delete log to the client. Also, I found that for my massive update, it was quite helpful to run the process in parallel.