This is a short tip for everyone out there implementing Data Vaults.
We’ve heard throughout the years about people trying to CLUSTER the Hash Keys in Data Vault on SQLServer and other databases that support clustering. This is a bad idea and should be avoided at all costs.
What is a Clustered Key?
Generally a Clustered Key means: KEEP THE DATA IN PHYSICAL SORTED ORDER when inserting by Clustered Key. This, co-locates data sets together that have similar keys. HASH KEYS ARE VARIANT, and are essentially a hexadecimal representation of a mathematical outcome. They are computed to be distributed over the length of bits available for the length of the hash. This, makes them a very bad choice for clustering in SQLServer or any other database. Click the links below to learn more about clustered keys.
When are Clustered Keys a GOOD IDEA?
Clustered Keys are a good idea WHEN you have true business values that you join on all the time, or where clause fields that you issue values for all the time, AND you need the data to be held together on disk to minimize I/O operations. Clustered Keys are a good idea when you have increasing sequence values, then the optimizers realize this and most can run mathematical algorithms for min/max, range, and set operations.
Otherwise: STAY AWAY FROM CLUSTERED KEYS – and never, ever put a cluster on a HASH KEY or HASH DIFFERENCE.