SQL Tricks


My latest debacle loading this massive table has to do with removing duplicates before I try adding a primary key to the table. I had to restart the load process a couple times before I finally settled into a method that worked well. (Running multilpe instances of DTExec on different subsets of the source data, all bulk-loading the same heap.)

I figured that rather than waiting to clean out dups when I created them, that I could simply write some scripts to clean them out afterwards? Right?

That works fine if you’re loading a few million rows, but when you’re talking HUNDREDS OF MILLIONS, you start to hit some walls. My usual method of removing duplicates is to (1) stage a set of distinct rows into a temp table, (2) delete rows from the original table using the temp table for keys, and (3) re-inserting rows from the temp table. Well, I have 50M unique rows with duplicates, and the SELECT DISTINCT trys to sort this set in tempdb, which quickly uses up all my disk space.

I need a way to do in-place deletes. In ORACLE, you can use ROWID to self-join the table and delete the unwanted rows. In SQL Server, I can use a CTE with row_number() but I’m really not sure if this will hit the same wall with tempdb.

The solution I found is this: DELETE TOP(n) FROM tablename

I set-up a cursor to loop through the table and DELETE TOP(n-1) rows for all cases with duplicates. It doesn’t take a lot of tempdb space. We’re in simple recovery, so it won’t generate a lot of logging. I think this is the best solution.

Here are the scripts:

  1. We need some sample data.


    select *
    into #dups
    from (
    select 1 col, 'blah' dat
    union all
    select 1 col, 'blah2' dat
    union all
    select 1 col, 'blah3' dat
    union all
    select 2 col, 'blah' dat
    union all
    select 2 col, 'blah2' dat
    union all
    select 2 col, 'blah3' dat
    union all
    select 2 col, 'blah4' dat
    union all
    select 3 col, 'blah' dat
    union all
    select 3 col, 'blah2' dat
    union all
    select 4 col, 'blah' dat
    ) a

  2. Here is the meat. We create a cursor with the unique values and counts, and use DELETE TOP (n-1) to run our row-by-row DELETE.


    SET NOCOUNT ON

    DECLARE @col int, @count int

    DECLARE cur CURSOR FORWARD_ONLY FAST_FORWARD READ_ONLY FOR
    SELECT col, count(*) row_count
    FROM #dups
    GROUP BY col
    HAVING count(*) > 1

    OPEN cur
    FETCH NEXT FROM cur INTO @col, @count

    WHILE @@FETCH_STATUS=0
    BEGIN
    DELETE TOP (@count-1) FROM #dups
    WHERE col = @col
    FETCH NEXT FROM cur INTO @col, @count
    END

    CLOSE cur
    DEALLOCATE cur

  3. And finally, our output data.
    1	blah3
    2	blah4
    3	blah2
    4	blah
    

    In my case, all the values for the non-key columns are the same, but I made them different in this case to show how DELETE TOP works.

The thing that bothers me about this solution is that it doesn’t batch up the deletes. I don’t want to run one delete because I’ll run out of log space, but it would be nice to batch it up somehow. Any ideas?

Update: I added “SET NOCOUNT ON” to avoid spooling the delete log to the client. Also, I found that for my massive update, it was quite helpful to run the process in parallel.

Dear Shane,

Next time trust me. SELECT INTO is always faster than UPDATE when you’re talking about updating the entire table. Even if the table has three NVARCHAR(MAX) fields and you’re only updating one 4 byte INT column.

Rather than run an update for 1hr 6min before you decide to cancel said query and trigger a 2hr 51min rollback of the transaction. Rather than writing a cursor to perform an in-place update of said heap using the only index you have on it. Just SELECT INTO a new table. This will update 168M rows in 48 minutes.

Listen to me next time. And save yourself some time.

Sincerely,
Your Instincts

So I need to import data from Excel into SQL Server. I whip up a new table, knock out a quick package, run the package, and look at the data. Looks fine, okay. Email the customer so they can look at the table. Two days later I get the email from the customer saying, “How come these rows have null values in the somekeyidentifier column? The data is in the Excel source.”

And thats when I realize that #1, I need to enforce some constraints when I’m whipping up new tables. And #2, Excel is not very good at guessing what datatypes I have. You see, an intelligent data profiling tool samples data when determining datatypes. Excel just looks at the top X rows and just ignores every row that doesn’t match that datatype.

The ever helpful Allan Mitchell at SQLDTS.com has some obscure tips for dealing with this.

I always seem to get this wrong, so please excuse this note to myself.

SELECT gpa,
DENSE_RANK() OVER (PARTITION BY strm, class_nbr ORDER BY gpa DESC) class_rank

Since the PeopleSoft table naming convention is to use as many underscores as possible, I’ve run into many snafus since the underscore is a special character in SQL. Clark (via Linda) showed us some ways around this.


select count(*)
from dba_tables
where table_name like '%\_UM' escape '\'
COUNT(*)
----------
113
1 row selected

This query, below, would pick up table names like PS_EXT_ACAD_SUM.

select count(*)
from dba_tables
where table_name like '%_UM'
COUNT(*)
----------
132
1 row selected

Some food for thought when working with NULL values.


select count(*)
from ps_dwsa_prog_dtl
COUNT(*)
----------
657662
1 row selected

select count(*)
from ps_dwsa_prog_dtl
where acad_sub_plan ='13'
COUNT(*)
----------
13
1 row selected

select count(*)
from ps_dwsa_prog_dtl
where acad_sub_plan <>‘13′
COUNT(*)
———-
148559
1 row selected

select count(*)
from ps_dwsa_prog_dtl
where acad_sub_plan is null
COUNT(*)
———-
509090
1 row selected

If you work with SQL — especially with Oracle — and you don’t read Ask Tom, you are missing a goldmine of tips and tuning advice. In the words of my former colleague Andy, “That Tom guy’s the shit.”

The best part of it is that Oracle pays him just to put his tips on the web, so we developers benefit. No login or payola required. Way to put developers first Oracle! Maybe Informatica could learn a few lessons from this.

Sometimes the best performance tips are those you don’t even think about. When to use UNION or UNION ALL. Most people will just use UNION unless they really need all the rows returned. You need to change this thinking. Use UNION ALL unless you know you need the duplicates filtered. Tom lays it out.