When reading through The Book regarding how to materialise the information mart layer, it looks all very neat and tidy. For example, for a dimension source the data from a source hub/satellite (keeping the example simple here: No PITs). The hashkey maps to one source hashkey in the vault and so you can simply rebuild a Type 1 dimension. It's more in depth for a Type 2, but the approach seems to be that it is driven by the whole satellite record.
Our reality is that for the most part a target dimension is not sourced by all of the attributes of the source satellite. Often, just a subset. For our type 2 dimensions, what we then see is essentially a load of 'false' type 2 records getting created - seemingly with the same values as the previous version. This is because the entire source satellite record is being used to determine if a change has occurred, which sees a change regularly, whereas the attributes subset the dimension is based on changed infrequently.
We have some satellites that are pretty wide - the business could not determine if there was any real rate of change difference between the attributes - and as the record is sourced from the same source system extract - there is no easy way to split. As such, our target dimension which contains only a subset of the source satellite record attributes (say around 10% in some cases) gets a new record when the remaining 90% of the satellite records get modified. There is an argument that the attributes not needed immediately in the mart should not be loaded into the satellite - but the business are keen to get all data in - and already are building new reporting using cases to bring more attributes into the mart from the source satellite versus what was initially defined.
I am wondering if there is a standard pattern to deal with this? I don't want the tail (the information mart) to wag the dog (the DV layer) by splitting the satellite up into arbitrary smaller satellites based on some downstream dimension requirement.
To prevent lots of redundant data being created for type 2 dimensions, my thought process was to add a custom hashdiff to the target type 2 dimension, which is a hash of just the attributes it needs. During the load you can compare this pre-calculated hashdiff with the on-the-fly calculated hash of the subset of source satellite attributes being extracted for load into the dimension - better that than compare each attribute individually (the very reason why a hashdiff for a satellite table exists) or just rely on some load date attribute which is acting like a sieve and letting too many records be loaded again.
The question is whether it is worth materialising the computed hash of the subset of attributes required for the satellite when the DV layer is loaded to improve performance of the Type 2 dim load process. If so, I would presume the place to do this is in a helper computed satellite table in the business vault - consisting of the hub hashkey, the load date and the computed hashdiff to assist in the dimension load.
Or is there a better way?
Perhaps, materialising the hashdiff of the attributes required for the target dim in the BV is a step too far and won't offer much of a performance benefit.
The hashdiff approach seems neat and tidy - perhaps this could be part of the guidance for materialising the information mart layer?
[yes, I know there are more complex use cases such as dealing with flattened dimension hierarchies which source data from several hubs and satellites - which makes the design and load pattern of the dimension more tricky - but that is a topic for another day...]