I have a question regarding the implementation of HASH_DIFF attributes to handle CDC (as mentioned by @kentgraziano in this forum thread https://datavaultalliance.com/discussions/members-tech-discussions/do-i-need-hash-keys/#post-177 ). How should we handle NULL values and the addition of other attributes. To illustrate my point I will use the example of a customer contact satellite with the attributes work_tel_number and home_tel_number.
If the record has the following values:
- work_tel_number: NULL
- home_tel_number: '555-7896'
and this then changes to:
- work_tel_number: '555-7896'
- home_tel_number: NULL
If we concatenate the two attributes together and then use MD5_BINARY to calculate the HASH value (when implementing on Snowflake), using an empty string for the NULL value will produce the same concatenated string, i.e. '555-7896'. To ensure that we can recognise the change in the record, we could use the word 'NULL' to represent the null value. The concatenated string would then be different for each of the combinations; i.e. 'NULL555-7896' versus '555-7896NULL'.
Is this the correct approach to take? A problem arises if we add another attribute to the satellite such as mobile_number, the HASH_DIFF will be different for all records after the change, even if the mobile_number is null. We will then be adding a new record where the HASH_DIFF is different but the field values are the same. Should we rather update the HASH_DIFF for the satellite records if a new field is added?
Any advice would be greatly appreciated.