We have built a data vault on snowflake which is yet to go into production. We were not getting the desired performance on raw data vault load or IM build to achieve out near real time goal and at the same time try and keep compute cost low.. So we worked with a snowflake expert to figure out why.
The results of the testing confirmed few things, joins using the hashkey almost always go into a full partition scan compared to using business keys. The natural clustering on snowflake works better with business keys than hash keys (Sha1). MD5 was slightly better but still the business keys perform way better.
To achieve the desired performance and to keep the compute cost down, we have to carry business keys in all raw data vault objects - hub/link/sats. Are we breaking any rule by doing so, please advise.
Also since we are trying to achieve near real time load in and out of the vault, we are now considering to insert any changed data thats coming thru from source without checking if that already exists. Our change data capture from source is pretty solid. We do anticipate duplicates in links, and rarely dups in hubs and sats too. But in that scenario we have our effectivity sats to get the most recent row from the link (same applies for sats too), so it really doesnt matter if there are duplicate rows is what we are thinking.
All are links are mostly driving key link so I wouldnt see any issue having duplicate rows in link as we can still retrieve the most recent row from effectivity using the driving key. What do you think about this approach?
Every time we go thru the entire table to insert a row into data vault, is not a scalable architecture, right? The performance is going to degrade as the raw data vault becomes bigger and bigger and achieving near realtime with it is nearly impossible. Please let me know what you have done in your environment to achieve near realtime and what do you think about what we are trying to do.