Jun

Massive In-Place Delete

Shane on June 29th, 2007

My latest debacle loading this massive table has to do with removing duplicates before I try adding a primary key to the table. I had to restart the load process a couple times before I finally settled into a method that worked well. (Running multilpe instances of DTExec on different subsets of the source data, all bulk-loading the same heap.)

I figured that rather than waiting to clean out dups when I created them, that I could simply write some scripts to clean them out afterwards? Right?

That works fine if you’re loading a few million rows, but when you’re talking HUNDREDS OF MILLIONS, you start to hit some walls. My usual method of removing duplicates is to (1) stage a set of distinct rows into a temp table, (2) delete rows from the original table using the temp table for keys, and (3) re-inserting rows from the temp table. Well, I have 50M unique rows with duplicates, and the SELECT DISTINCT trys to sort this set in tempdb, which quickly uses up all my disk space.

I need a way to do in-place deletes. In ORACLE, you can use ROWID to self-join the table and delete the unwanted rows. In SQL Server, I can use a CTE with row_number() but I’m really not sure if this will hit the same wall with tempdb.

The solution I found is this: DELETE TOP(n) FROM tablename

I set-up a cursor to loop through the table and DELETE TOP(n-1) rows for all cases with duplicates. It doesn’t take a lot of tempdb space. We’re in simple recovery, so it won’t generate a lot of logging. I think this is the best solution.

Here are the scripts:

We need some sample data.
select * into #dups from ( select 1 col, 'blah' dat union all select 1 col, 'blah2' dat union all select 1 col, 'blah3' dat union all select 2 col, 'blah' dat union all select 2 col, 'blah2' dat union all select 2 col, 'blah3' dat union all select 2 col, 'blah4' dat union all select 3 col, 'blah' dat union all select 3 col, 'blah2' dat union all select 4 col, 'blah' dat ) a
Here is the meat. We create a cursor with the unique values and counts, and use DELETE TOP (n-1) to run our row-by-row DELETE.
SET NOCOUNT ON
DECLARE @col int, @count int DECLARE cur CURSOR FORWARD_ONLY FAST_FORWARD READ_ONLY FOR SELECT col, count(*) row_count FROM #dups GROUP BY col HAVING count(*) > 1 OPEN cur FETCH NEXT FROM cur INTO @col, @count WHILE @@FETCH_STATUS=0 BEGIN DELETE TOP (@count-1) FROM #dups WHERE col = @col FETCH NEXT FROM cur INTO @col, @count END
CLOSE cur DEALLOCATE cur
And finally, our output data.
```
1	blah3
2	blah4
3	blah2
4	blah
```
In my case, all the values for the non-key columns are the same, but I made them different in this case to show how DELETE TOP works.

The thing that bothers me about this solution is that it doesn’t batch up the deletes. I don’t want to run one delete because I’ll run out of log space, but it would be nice to batch it up somehow. Any ideas?

Update: I added “SET NOCOUNT ON” to avoid spooling the delete log to the client. Also, I found that for my massive update, it was quite helpful to run the process in parallel.

Tags: SQL Server, SQL Tricks

SQL Server, SQL Tricks No comments »

Jun

BULK UPDATE

Shane on June 28th, 2007

Dear Shane,

Next time trust me. SELECT INTO is always faster than UPDATE when you’re talking about updating the entire table. Even if the table has three NVARCHAR(MAX) fields and you’re only updating one 4 byte INT column.

Rather than run an update for 1hr 6min before you decide to cancel said query and trigger a 2hr 51min rollback of the transaction. Rather than writing a cursor to perform an in-place update of said heap using the only index you have on it. Just SELECT INTO a new table. This will update 168M rows in 48 minutes.

Listen to me next time. And save yourself some time.

Sincerely,
Your Instincts

Tags: SQL Server, SQL Tricks

SQL Server, SQL Tricks No comments »

Jun

Favorite New SQL 2005 Syntax

Shane on June 28th, 2007

When I was working with Oracle, I got hooked on using MINUS and INTERSECT. Well, it turns out that Microsoft implemented these functions in a slightly different way in SQL 2005.

Jeff Smith lays out a nice example here.

Tags: Other, SQL Server

Other, SQL Server No comments »

Apr

See Everything, Everywhere

Shane on April 19th, 2007

My friend Gary is fulfilling his life-long dream to take a trip around the world. This being the internet age, we get to follow along from the comfort of our living rooms without having to get the shots. I encourage you to check it out. Gary is one of the funniest and smartest people I know, so it will be entertaining, if nothing else.

He posted his first real update today from the big Island of Hawaii.

Tags: Life, Other

Other 1 Comment »

Apr

BI Humor

Shane on April 2nd, 2007

Here’s a little BI humor for anyone from one of my favorite BI writers: Donald Farmer.

Tags: Data Mining, Other

Other No comments »

Feb

Checksum Code Generator

Shane on February 12th, 2007

Also on the subject of checksum testing, this query will generate a checksum query for a given table. Use this query to generate the checksum for each table, then customize the sum() clauses if you did any transformations on your data.

select case a.xtype when 167 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+',' when 175 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+',' when 231 then 'SUM(LEN('+a.name+')) as sum_len_'+a.name+',' when 106 then 'SUM('+a.name+') as sum_'+a.name+',' when 56 then 'SUM('+a.name+') as sum_'+a.name+',' when 104 then 'SUM(CONVERT(int,'+a.name+')) as sum_'+a.name+',' when 61 then 'SUM(MONTH('+a.name+')+DAY('+a.name+')+YEAR('+a.name+')) as sum_'+a.name+', SUM(LEN(MONTH('+a.name+'))+LEN(DAY('+a.name+'))+LEN(YEAR('+a.name+'))) as sum_len_'+a.name+',' -- datetime else a.name+' '+convert(varchar,a.xtype) end col_type from syscolumns a, sysobjects b where a.id=b.id and b.xtype='U' and upper(b.name) = 'MYTABLE' order by b.name, a.name, a.colid

Tags: Other, SQL Server, SQL Tricks

Other No comments »

Jan

Extracting Metadata From SQL Server

Shane on January 31st, 2007

select upper(b.name) as table_name, upper(a.name) as column_name, a.colid, a.length, a.xprec, a.xscale, a.xtype,(CASE WHEN a.xtype=167 then 'varchar' WHEN a.xtype=175 then 'char' WHEN a.xtype=61 then 'datetime' WHEN a.xtype=106 then 'decimal' END) as col_type from syscolumns a, sysobjects b where a.id=b.id and b.xtype='U' and upper(b.name) <> 'DTPROPERTIES' order by b.name, a.name, a.colid