Informatica


The holy grail of a former .NET web developer is how to combine the techniques of agile and test-driven development with ETL tools and data volumes. Mark Rittman shows that he is a brother-in-arms with my quest.

This excellent piece in DMReview deliberates on some criteria for selecting (or justifying) an ETL tool. We are struggling with this right now. Our shop uses a combination of Informatica and SQL Server DTS for our ETL jobs. With Oracle Warehouse Builder 10gR2 and SQL Server Integration Services both released as viable alternatives, we need to justify our continued six-figure annual investment in Informatica. Is it worth it?

I got a question about PowerExchange throughput the other day. When we were first evaluating PowerExchange as a solution, this was a primary concern of ours. We devoted hundreds of hours designing, implementing, and refining a good test of our throughput. In the end, we over-tested. The throughputs that we dealt with in the test were so much more than our production systems handled, even during peak load, that the actual system handles its load with great ease. Of course, sometimes that’s the point of load testing: to determine theoretical and not actual limits.

Our physical environment
We have a source (PeopleSoft) database running on Oracle 9i/Solaris. A target (ODS) database running on Oracle 9i/Solaris. And we have PowerExchange running on the same Windows 2003 box as PowerCenter. In mid-July, we’ll be upgrading both source and target databases to 10g.

Our data environment
Our current change data capture mappings captures 10 source tables and writes to 30 targets. 3 for each source table. 6 of these tables are considered “wide” tables. These are also the tables with the most DML activity in our PeopleSoft system. One has 23 columns totalling 2.2 kb/row. Another has 148 columns totaling 1.9 kb/row. You get the picture. Lots of data per row. We are capturing every byte in our ODS, plus some extra LogMiner columns like cap_tm.

We don’t have a high, sustained throughput in our system. Our high throughput periods are when batch cycles complete. This is usually nights/weekends when source/ODS latency is not as much of a concern.

Our PowerExchange settings
After discussing our environment with Informatica, we’ve come up with a couple tuning parameters for our system. These are tuned for lower latency, not higher throughput.

  1. All of our sources/targets in the same mapping.
    The fewer open sessions in LogMiner, the faster your reads are going to be. This one was non-negotiable since our primary directive (and reason for using PowerExchange instead of using Oracle triggers) was not to impact the source database.
  2. UOW COUNT
    This parameter in the source connection is set to 1. When PowerExchange reads one changed row from the source, it sends it directly to PowerCenter. While I don’t have any specific measurements, I seem to remember dramatically higher throughput processing batch commits when we had this set higher.
  3. Real-Time Flush Latency
    This is actually set to 10000ms, which I seem to remember was the default. No significance here, other then that when we did load testing, we used the flush latency and UOW count to balance high throughput with tolerable latency. If you have a generous service level agreement, like “Capture source data within five minutes of the change.” You can up your real-time flush latency and your UOW Count and have PowerCenter process mini-batches of rows. This would really improve throughput.

So, how much throughput?
This varies by table, of course, because the wider tables have lower throughput than the narrower ones. We have bursts of throughput of up to 4700 rows per second. But that number doesn’t really tell you much. 4700 rows of a 23 byte table isn’t much data. We’re not sure if that number is actually updates of our 23 byte table. Actually, I should look into that.

The clearer measure of throughput is data volume or bytes/sec. Not rows/sec.

The most frequent batch job we see happens to update our widest source table. We regularly see throughputs of up to 1200 rows per second in the mornings when this job commits. That’s 2.6 MB/s. Not bad. Our “standard” throughput after batch windows is between 500-1000 rows per second. Which is still between 1-2MB/s.

In summary, our biggest concern in using PowerExchange turned out to work very well. We were able to get by tuning very few variables (just the source connection in the Workflow Manager) without having to look into source/target tuning.

I invite your comments via email scolin@pobox.com or here.

Actually, I’m not at the office, but I am doing what I would consider “work” today. I’m at the TDWI World Conference in Orlando.

Today I attended Stephen Brobst’s class on “Designing a High Performance Data Warehouse”. While the course came highly recommended, I was somewhat skeptical going in that, being software agnostic, it would be too theoretical to really put to use. But I was astounded by how much content was in this course.

We reviewed:

  1. Join Strategies
  2. Indexing Strategies
  3. Database Parallelism
  4. Partitioning Strategies
  5. OLAP & Aggregation

There was a whole lot of stuff in between. I’m going to need to bone up on employing partitions and parallelism both in the databases (Oracle and SQL Server) and in our ETL tools (Informatica and eventually SSIS). There is a lot that can be done architecturally that could help our processes scale. He had a particularly good illustration Amdahl’s Law, demonstrating the benefits of parallelism, that I will need to draw out and put up here. The point being that many processes need to be run sequentially, but you get a huge boost out of finding ones that can run in parallel and doing so.

I will hopefully have more details later, but I need to run to a session that Informatica is putting on. I will probably be coding tonight. Put some of this creative energy to work.