SQL Server


Wondering which SQL Job runs which subscription? Here’s how to find out.


select a.name, d.name, d.path, c.description, c.laststatus, c.lastruntime
from msdb.dbo.sysjobs a
join reportserver..reportschedule b on a.name=cast(b.scheduleid as nvarchar(255))
join reportserver..subscriptions c on b.subscriptionid=c.subscriptionid
join reportserver..catalog d on c.report_oid=d.itemid
order by c.lastruntime desc

Now that our datawarehouse database is over 250GB, using the built-in SQL Server tools are not an option. I have experience with Veritas BackupExec and NetBackup, and here is my typical experience: backups are scheduled overnight and usually take the entire night. Restores are no faster.

Here’s a product that’s worth looking in to: Idera SQLSafe. This products creates compressed backup files on the local drives. They advertise up to 95% compression and they’re not joking. We were able to achieve 80% compression and the backup took a third the time. Our 270GB datawarehouse backs up to a 55GB file in 1:20.

We have yet to perform a mission-critical restore, but if they are as fast as the backups, then we’re talking about a pretty small outage for having to restore the entire database from scratch.

And the best part is that, at < $1K per instance list price, it's much cheaper than the so-called "enterprise" solutions.

[Update]: If the price is still too steep, there is a pretty well-enabled freeware option. I need to do some checking, but based on the feature list, I wonder if we aren’t using the freeware.

My latest debacle loading this massive table has to do with removing duplicates before I try adding a primary key to the table. I had to restart the load process a couple times before I finally settled into a method that worked well. (Running multilpe instances of DTExec on different subsets of the source data, all bulk-loading the same heap.)

I figured that rather than waiting to clean out dups when I created them, that I could simply write some scripts to clean them out afterwards? Right?

That works fine if you’re loading a few million rows, but when you’re talking HUNDREDS OF MILLIONS, you start to hit some walls. My usual method of removing duplicates is to (1) stage a set of distinct rows into a temp table, (2) delete rows from the original table using the temp table for keys, and (3) re-inserting rows from the temp table. Well, I have 50M unique rows with duplicates, and the SELECT DISTINCT trys to sort this set in tempdb, which quickly uses up all my disk space.

I need a way to do in-place deletes. In ORACLE, you can use ROWID to self-join the table and delete the unwanted rows. In SQL Server, I can use a CTE with row_number() but I’m really not sure if this will hit the same wall with tempdb.

The solution I found is this: DELETE TOP(n) FROM tablename

I set-up a cursor to loop through the table and DELETE TOP(n-1) rows for all cases with duplicates. It doesn’t take a lot of tempdb space. We’re in simple recovery, so it won’t generate a lot of logging. I think this is the best solution.

Here are the scripts:

  1. We need some sample data.


    select *
    into #dups
    from (
    select 1 col, 'blah' dat
    union all
    select 1 col, 'blah2' dat
    union all
    select 1 col, 'blah3' dat
    union all
    select 2 col, 'blah' dat
    union all
    select 2 col, 'blah2' dat
    union all
    select 2 col, 'blah3' dat
    union all
    select 2 col, 'blah4' dat
    union all
    select 3 col, 'blah' dat
    union all
    select 3 col, 'blah2' dat
    union all
    select 4 col, 'blah' dat
    ) a

  2. Here is the meat. We create a cursor with the unique values and counts, and use DELETE TOP (n-1) to run our row-by-row DELETE.


    SET NOCOUNT ON

    DECLARE @col int, @count int

    DECLARE cur CURSOR FORWARD_ONLY FAST_FORWARD READ_ONLY FOR
    SELECT col, count(*) row_count
    FROM #dups
    GROUP BY col
    HAVING count(*) > 1

    OPEN cur
    FETCH NEXT FROM cur INTO @col, @count

    WHILE @@FETCH_STATUS=0
    BEGIN
    DELETE TOP (@count-1) FROM #dups
    WHERE col = @col
    FETCH NEXT FROM cur INTO @col, @count
    END

    CLOSE cur
    DEALLOCATE cur

  3. And finally, our output data.
    1	blah3
    2	blah4
    3	blah2
    4	blah
    

    In my case, all the values for the non-key columns are the same, but I made them different in this case to show how DELETE TOP works.

The thing that bothers me about this solution is that it doesn’t batch up the deletes. I don’t want to run one delete because I’ll run out of log space, but it would be nice to batch it up somehow. Any ideas?

Update: I added “SET NOCOUNT ON” to avoid spooling the delete log to the client. Also, I found that for my massive update, it was quite helpful to run the process in parallel.

Dear Shane,

Next time trust me. SELECT INTO is always faster than UPDATE when you’re talking about updating the entire table. Even if the table has three NVARCHAR(MAX) fields and you’re only updating one 4 byte INT column.

Rather than run an update for 1hr 6min before you decide to cancel said query and trigger a 2hr 51min rollback of the transaction. Rather than writing a cursor to perform an in-place update of said heap using the only index you have on it. Just SELECT INTO a new table. This will update 168M rows in 48 minutes.

Listen to me next time. And save yourself some time.

Sincerely,
Your Instincts

When I was working with Oracle, I got hooked on using MINUS and INTERSECT. Well, it turns out that Microsoft implemented these functions in a slightly different way in SQL 2005.

Jeff Smith lays out a nice example here.

The holy grail of a former .NET web developer is how to combine the techniques of agile and test-driven development with ETL tools and data volumes. Mark Rittman shows that he is a brother-in-arms with my quest.

This excellent piece in DMReview deliberates on some criteria for selecting (or justifying) an ETL tool. We are struggling with this right now. Our shop uses a combination of Informatica and SQL Server DTS for our ETL jobs. With Oracle Warehouse Builder 10gR2 and SQL Server Integration Services both released as viable alternatives, we need to justify our continued six-figure annual investment in Informatica. Is it worth it?

Next Page »