Just thought I'd share this tid-bit of knowledge with everyone. There are no shortages of smart people here, and certainly many different ways to write a SQL query against "current records", or against hubs+Satellites, Links+Satellites, etc.. So here are a few tips for performance tuning & testing your SQL along the way:
- Know that your SQL will behave differently on different platforms, AND on different versions of those platforms. Optimizers can and do change even across versions.
- Know that if your statistics are out of date (or never been calculated), your optimizers, your plans, your queries can appear slow.
- That if you update or create statistics in Development, that those stats, may never have been calculated or updated in production or even test/QA environments
- That you should concern yourself with Logical Reads NEVER Physical reads. Why? Because logical reads tell you: IF the DBMS had to go to disk, how many times it would have gone to disk. If you can "reduce" the logical read count, you can almost always improve performance.
- Scan counts are important as well
- Elapsed Time (NOT CPU time) is important. CPU time will measure how long the CPU was active executing tasks, which is nice, but it adds those timings together for multi-core, which can give you "longer than elapsed" execution times, leaving you to scratch your head (its because the CPU executes in parallel, but the "time measurement" is additive as if the CPU executed in Sequential order). Elapsed time is more critical
- Remember: to "truly" test what's in production for performance, the rule of thumb is: 25% of the size of the production system, BUT to replicate EXACTLY all the partitioning, parallel settings, indexing, and disk layout. This rarely (if ever) happens properly in environments I've seen. 50% of the size of production should be volume tested in QA / Test environments (for accurate extrapolations), and if the "curve of performance" doesn't fit - from Development to Production sizing, then either your math is wrong, your environment doesn't match, or your setup is incorrect.
When testing in Development SQL against A DV for performance remember the following:
- Please test with a minimum of 5 million records in EACH Satellite, and a minimum of 3 million hub keys. I personally would recommend 100 million in each Satellite, followed by 25 million hub keys (generated data sets are OK for performance testing). I would suggest 75 million minimum in each link you wish to join to.
- Make sure there are at least 2 deltas / changed records assigned to 80% of hub keys.
- Run your query at least 7 times consecutively, and measure the average over 7 runs.
- NEVER EVER simply run a "select count(*)" for performance tuning, this is possibly the worst you can do, because the data blocks might never be hit. Instead, run a full select of all the columns you hope to access (a Select *) across all tables will give you "worst case scenario" - which can only lead to better performance
- Don't assume that tuning a single join to a single Satellite will give you the performance you are expecting once you combine that query with others against other Satellites.
- Try to write a single view against all raw Data Vault objects you want to access. "Stacking" views in Oracle and SQLServer will have unintended and sometimes undesired negative performance impacts when run against volume.
- TRY to learn & use WITH statements (CTE's / OLAP functions - instead of nested sub queries), because most optimizers will write better plans this way. Again, this depends on the platform you implement on.
- NEVER EVER EVER CLUSTER YOUR HASH KEYS!!! This will lead to many many problems with load performance, as well as performance lock ups to real-time streaming feeds. Clustering (which is ON BY DEFAULT on SQLServer - and needs to be explicitly SHUT OFF), will force every single inserted row to "move physical data from one disk block to another" as data is inserted (bulk or not).
- SET YOUR LOGGING TO BULK LOGGING for EVERY database (stage, raw DV, Business DV, Mart layers) (for SQLServer) to avoid index locking, and unnecessary I/O (logging I/O) during mass insert cycles. In Oracle set your logging to minimal as possible.
- DO NOT under any circumstances SWITCH the order of your Satellite Primary Keys - you may think that putting the load-date first would help performance. It doesn't., in fact its the opposite. It kills the hash merge joins that COULD happen and replaces them with Nested Loops.
- Work your SQL Plans from the driving table (smallest table) to the Largest table, by adding joins to your query as you build it. WATCH your Cartesian products, try to reduce the data sets appropriately before joining additional data.
- YOU MAY: need to add extra Unique NON CLUSTERED indexes to your Satellites to include any attributes commonly referenced in queries (that build PITS / Bridges) downstream.
- YOU MAY: need to partition your Satellites by time (most common).
There are about 100+ more tips, this is just the tip of the iceberg... You could actually pay attention to all of these tips, OR you could simply move to SnowflakeDB and forget about all of this (because you don't have to worry about it).