Oracle


I didn’t get a chance to catch Michael Ault’s presentation at the recent TCOUG, but I heard it was a good talk. Thank you Michael for putting the slides online.

The holy grail of a former .NET web developer is how to combine the techniques of agile and test-driven development with ETL tools and data volumes. Mark Rittman shows that he is a brother-in-arms with my quest.

This excellent piece in DMReview deliberates on some criteria for selecting (or justifying) an ETL tool. We are struggling with this right now. Our shop uses a combination of Informatica and SQL Server DTS for our ETL jobs. With Oracle Warehouse Builder 10gR2 and SQL Server Integration Services both released as viable alternatives, we need to justify our continued six-figure annual investment in Informatica. Is it worth it?

I went to a seminar on Oracle partitioning and data warehousing earlier this week call “Scaling to Infinity” by Tim Gorman. It gave me some new ideas, and reinforced some others, for where we should take our architecture at the U.

Here are some notes that I will hopefully talk more about as I research them.

  1. Functions and query optimization
    One slide he had was a demo where
    TRUNC(date_posted) = '10/12/06'
    would force a table scan where
    date_posted >= '10/12/06'
    AND date_posted < '10/13/05'

    would use an index or use partition-pruning to execute. It seems to me that if you have good statistics, then the CBO would be smart enough to partition-prune even with the first method. The instructor recommended using a function-based index. There must be an easier way.
  2. Why do we get “corrupt index” errors when people are running queries during partition exchanges? According to the presenter, partition exchanges do not take effect until all the index rebuilds, etc. have completed. PE commits like a regular transaction. I’ll need to replicate this error to figure out what’s going on.
  3. “Range of ROWIDs stored in a direct-path log made visible by view ALL_SUMDELTA” Not very useful, but interesting nonetheless.
  4. DBMS_MVIEW.REFRESH Fast Refresh does not work if your MVIEW has unions, subqueries, etc.
  5. You can use SQL*Loader to generate external table syntax? I’ll need to check this out. Generating external table syntax, especially for wide flat files, is a major PITA.
  6. DBMS_XPLAN is a way to look at explain plans.
  7. Table compression. Test table compression (and exchange partitioning) for a large transactions table. Possibly run the loads in parallel in development with the regular production table for a while to see if there are any gotchas. Test ETL performance, maintenance “costs”, space usage, and query performance.
  8. DBMS_REDEFINITION is a package for doing online table reorgs. It would have been nice to have this during our tablespace move.
  9. Does Oracle Warehouse Builder (OWB) do exchange partitioning without a lot of manual coding?
  10. SYS.COL_USAGE$ keeps usage stats on columns. It tells you how often columns are used in WHERE clauses and how they are used? Equi-joins, range predicates, etc.
  11. Download exchpart.sql to automate partition exchange for ETL.
  12. Bitmap join indexes. Very cool technology. You basically create a bitmap index of your dimensions. Then if you are quering multiple dimensions versus the fact table, it can just join the BJI to the bitmap indexes on the fact.
  13. Transportable tablespaces for DWPROD. Could we create PCOM3 and then attach that tablespace to DWPROD to simply our ETL and eliminate those pesky database links.
  14. ALTER SESSION SET CURRENT_SCHEMA
  15. PARALLEL good for lots of processing that return small result sets or CREATE TABLE AS SELECT. If you return large result set, your bottleneck will be the query coordinator.
  16. Loading exchange partitions when your partition key is not the load date. Tim had a neat technique for dealing with the dribble-over between partitions.
  17. Data Vault. Looks promising for creating point-in-time representations of your data.

I got a question about PowerExchange throughput the other day. When we were first evaluating PowerExchange as a solution, this was a primary concern of ours. We devoted hundreds of hours designing, implementing, and refining a good test of our throughput. In the end, we over-tested. The throughputs that we dealt with in the test were so much more than our production systems handled, even during peak load, that the actual system handles its load with great ease. Of course, sometimes that’s the point of load testing: to determine theoretical and not actual limits.

Our physical environment
We have a source (PeopleSoft) database running on Oracle 9i/Solaris. A target (ODS) database running on Oracle 9i/Solaris. And we have PowerExchange running on the same Windows 2003 box as PowerCenter. In mid-July, we’ll be upgrading both source and target databases to 10g.

Our data environment
Our current change data capture mappings captures 10 source tables and writes to 30 targets. 3 for each source table. 6 of these tables are considered “wide” tables. These are also the tables with the most DML activity in our PeopleSoft system. One has 23 columns totalling 2.2 kb/row. Another has 148 columns totaling 1.9 kb/row. You get the picture. Lots of data per row. We are capturing every byte in our ODS, plus some extra LogMiner columns like cap_tm.

We don’t have a high, sustained throughput in our system. Our high throughput periods are when batch cycles complete. This is usually nights/weekends when source/ODS latency is not as much of a concern.

Our PowerExchange settings
After discussing our environment with Informatica, we’ve come up with a couple tuning parameters for our system. These are tuned for lower latency, not higher throughput.

  1. All of our sources/targets in the same mapping.
    The fewer open sessions in LogMiner, the faster your reads are going to be. This one was non-negotiable since our primary directive (and reason for using PowerExchange instead of using Oracle triggers) was not to impact the source database.
  2. UOW COUNT
    This parameter in the source connection is set to 1. When PowerExchange reads one changed row from the source, it sends it directly to PowerCenter. While I don’t have any specific measurements, I seem to remember dramatically higher throughput processing batch commits when we had this set higher.
  3. Real-Time Flush Latency
    This is actually set to 10000ms, which I seem to remember was the default. No significance here, other then that when we did load testing, we used the flush latency and UOW count to balance high throughput with tolerable latency. If you have a generous service level agreement, like “Capture source data within five minutes of the change.” You can up your real-time flush latency and your UOW Count and have PowerCenter process mini-batches of rows. This would really improve throughput.

So, how much throughput?
This varies by table, of course, because the wider tables have lower throughput than the narrower ones. We have bursts of throughput of up to 4700 rows per second. But that number doesn’t really tell you much. 4700 rows of a 23 byte table isn’t much data. We’re not sure if that number is actually updates of our 23 byte table. Actually, I should look into that.

The clearer measure of throughput is data volume or bytes/sec. Not rows/sec.

The most frequent batch job we see happens to update our widest source table. We regularly see throughputs of up to 1200 rows per second in the mornings when this job commits. That’s 2.6 MB/s. Not bad. Our “standard” throughput after batch windows is between 500-1000 rows per second. Which is still between 1-2MB/s.

In summary, our biggest concern in using PowerExchange turned out to work very well. We were able to get by tuning very few variables (just the source connection in the Workflow Manager) without having to look into source/target tuning.

I invite your comments via email scolin@pobox.com or here.

I scan a handful of financial blogs pretty regularly. I also scan a handful of technical blogs relating to SQL Server and Oracle. It isn’t very often that a blog from one of these worlds links to a blog in the other. But here’s JLP from AllFinancialMatters linking to Don Burleson of oracle-dba.com fame. I guess we’re all just geeks.

Since the PeopleSoft table naming convention is to use as many underscores as possible, I’ve run into many snafus since the underscore is a special character in SQL. Clark (via Linda) showed us some ways around this.


select count(*)
from dba_tables
where table_name like '%\_UM' escape '\'
COUNT(*)
----------
113
1 row selected

This query, below, would pick up table names like PS_EXT_ACAD_SUM.

select count(*)
from dba_tables
where table_name like '%_UM'
COUNT(*)
----------
132
1 row selected

Next Page »