My latest debacle loading this massive table has to do with removing duplicates before I try adding a primary key to the table. I had to restart the load process a couple times before I finally settled into a method that worked well. (Running multilpe instances of DTExec on different subsets of the source data, all bulk-loading the same heap.)
I figured that rather than waiting to clean out dups when I created them, that I could simply write some scripts to clean them out afterwards? Right?
That works fine if you’re loading a few million rows, but when you’re talking HUNDREDS OF MILLIONS, you start to hit some walls. My usual method of removing duplicates is to (1) stage a set of distinct rows into a temp table, (2) delete rows from the original table using the temp table for keys, and (3) re-inserting rows from the temp table. Well, I have 50M unique rows with duplicates, and the SELECT DISTINCT trys to sort this set in tempdb, which quickly uses up all my disk space.
I need a way to do in-place deletes. In ORACLE, you can use ROWID to self-join the table and delete the unwanted rows. In SQL Server, I can use a CTE with row_number() but I’m really not sure if this will hit the same wall with tempdb.
The solution I found is this: DELETE TOP(n) FROM tablename
I set-up a cursor to loop through the table and DELETE TOP(n-1) rows for all cases with duplicates. It doesn’t take a lot of tempdb space. We’re in simple recovery, so it won’t generate a lot of logging. I think this is the best solution.
Here are the scripts:
- We need some sample data.
select *
into #dups
from (
select 1 col, 'blah' dat
union all
select 1 col, 'blah2' dat
union all
select 1 col, 'blah3' dat
union all
select 2 col, 'blah' dat
union all
select 2 col, 'blah2' dat
union all
select 2 col, 'blah3' dat
union all
select 2 col, 'blah4' dat
union all
select 3 col, 'blah' dat
union all
select 3 col, 'blah2' dat
union all
select 4 col, 'blah' dat
) a
- Here is the meat. We create a cursor with the unique values and counts, and use DELETE TOP (n-1) to run our row-by-row DELETE.
SET NOCOUNT ONDECLARE @col int, @count int
DECLARE cur CURSOR FORWARD_ONLY FAST_FORWARD READ_ONLY FOR
SELECT col, count(*) row_count
FROM #dups
GROUP BY col
HAVING count(*) > 1OPEN cur
FETCH NEXT FROM cur INTO @col, @countWHILE @@FETCH_STATUS=0
BEGIN
DELETE TOP (@count-1) FROM #dups
WHERE col = @col
FETCH NEXT FROM cur INTO @col, @count
ENDCLOSE cur
DEALLOCATE cur
- And finally, our output data.
1 blah3 2 blah4 3 blah2 4 blah
In my case, all the values for the non-key columns are the same, but I made them different in this case to show how DELETE TOP works.
The thing that bothers me about this solution is that it doesn’t batch up the deletes. I don’t want to run one delete because I’ll run out of log space, but it would be nice to batch it up somehow. Any ideas?
Update: I added “SET NOCOUNT ON” to avoid spooling the delete log to the client. Also, I found that for my massive update, it was quite helpful to run the process in parallel.
Tags: SQL Server, SQL Tricks
Dear Shane,
Next time trust me. SELECT INTO is always faster than UPDATE when you’re talking about updating the entire table. Even if the table has three NVARCHAR(MAX) fields and you’re only updating one 4 byte INT column.
Rather than run an update for 1hr 6min before you decide to cancel said query and trigger a 2hr 51min rollback of the transaction. Rather than writing a cursor to perform an in-place update of said heap using the only index you have on it. Just SELECT INTO a new table. This will update 168M rows in 48 minutes.
Listen to me next time. And save yourself some time.
Sincerely,
Your Instincts
Tags: SQL Server, SQL Tricks
When I was working with Oracle, I got hooked on using MINUS and INTERSECT. Well, it turns out that Microsoft implemented these functions in a slightly different way in SQL 2005.
Jeff Smith lays out a nice example here.
Tags: Other, SQL Server
My friend Gary is fulfilling his life-long dream to take a trip around the world. This being the internet age, we get to follow along from the comfort of our living rooms without having to get the shots. I encourage you to check it out. Gary is one of the funniest and smartest people I know, so it will be entertaining, if nothing else.
He posted his first real update today from the big Island of Hawaii.
Here’s a little BI humor for anyone from one of my favorite BI writers: Donald Farmer.
Tags: Data Mining, Other
Also on the subject of checksum testing, this query will generate a checksum query for a given table. Use this query to generate the checksum for each table, then customize the sum() clauses if you did any transformations on your data.
select case a.xtype
when 167 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+','
when 175 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+','
when 231 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+','
when 106 then 'SUM('+a.name+') as sum_'+a.name+','
when 56 then 'SUM('+a.name+') as sum_'+a.name+','
when 104 then 'SUM(CONVERT(int,'+a.name+')) as sum_'+a.name+','
when 61 then 'SUM(MONTH('+a.name+')+DAY('+a.name+')+YEAR('+a.name+')) as sum_'+a.name+',
SUM(LEN(MONTH('+a.name+'))+LEN(DAY('+a.name+'))+LEN(YEAR('+a.name+'))) as sum_len_'+a.name+',' -- datetime
else a.name+' '+convert(varchar,a.xtype)
end col_type
from syscolumns a, sysobjects b
where a.id=b.id and b.xtype='U'
and upper(b.name) = 'MYTABLE'
order by b.name, a.name, a.colid
Tags: Other, SQL Server, SQL Tricks
select upper(b.name) as table_name, upper(a.name) as column_name,
a.colid, a.length, a.xprec, a.xscale, a.xtype,(CASE WHEN a.xtype=167 then 'varchar'
WHEN a.xtype=175 then 'char'
WHEN a.xtype=61 then 'datetime'
WHEN a.xtype=106 then 'decimal'
END) as col_type
from syscolumns a,
sysobjects b
where a.id=b.id
and b.xtype='U'
and upper(b.name) <> 'DTPROPERTIES'
order by b.name, a.name, a.colid
Tags: Other, SQL Server, SQL Tricks