Hash Collisions  


Posts: 141
(@patrickcuba)
Estimable Member
Joined: 6 months ago

A requirement for loading Data Vault 2.0 is that you have a hash collision strategy. Hash Keys and HashDiffs use hashing but the more likely of the two to have a collision is the hash key because there will be a lot more variation to the business keys than the contextual information about the business entity or relationship. 

A BKey collision could occur:

  • within the same source system (hopefully not! if it does then toss the source system or reconsider how you have modeled your raw vault!) 
  • between source systems, if so then consider adding salt. This is applicable if the same business key will load to the same hub but represent different business entities.
  • between tenants of the data vault, consider adding more salt. Should tenant 1 be able to see tenant 2's data?

Example:

image2018-9-24_11-12-59.png 

Add salt:

image2018-9-24_11-14-47.png

Add more salt:

 

Hash-Key treatments:

  1. apply zero-key to null keys
  2. cast business keys as text
  3. apply hard rule (trim, left align)
  4. apply casing (lower or upper depending on your preference)
  5. add salt
    1. add business key collision code (use a default value if none is needed)
    2. add multi-tenancy id (set a default if single tenancy data vault)
  6. use sanding (delimiter like the double pipe ||) between business keys
  7. hash using an algorithm like SHA2 (min.)

Data Type: binary (they use half the storage as varchar), use hash keys if they outperform natural keys. But remember that you should not have a blended data vault of half natural keys half hash-keys, use one method for all otherwise your data vault will become an untidy mess.

Your selection of hashing is also a part of your collision strategy, MD5 is no good https://www.mscs.dal.ca/~selinger/md5collision/

But consider that the larger the hash the longer it takes to calculate so you need to strike a balance (we use SHA2).

Collision checks

There are two kinds that must be performed before every load

a. Horizontal check (src vs tgt)

image2018-9-24_11-31-40.png 

b. Vertical check (within src)

image2018-9-24_11-37-36.png 

Both applied after the data has been hashed, a. will I get a clash with the target hub? and b. do I already have a clash in the staged file?

And don't forget to check the LINK hash key as well!

image2018-10-18_16-56-35.png 

Now onto HashDiff treatments:

  1. substitute NULL for blank for each attribute where NULL occurs (hashing nulls does not work)
  2. use sanding (delimiter like the double pipe ||) between business keys
  3. hash using an algorithm like SHA2 (same as hash-key)

Less treatments but also far less likely to have collisions because the attributes being hashed are compared to the target satellite's hashdiff by key. Using HashDiffs perform great because they are hashed to a consistent size and stored as binary. But as the training identifies a system like Teradata hashes the columns implicitly when doing column by column comparison between source and target.

Conclusion

If after applying the treatments we have here and you find a hash collision, write a paper and get it published!

 

Reply
2 Replies

Please Login or Register