This is an implementation discussion but found as a topic at several clients as to why we cast business keys.
One such example is when I asked a developer why in HUB_ACCOUNT he had data typed the account number as a type integer, his answer was "it's a number."
I asked, "right now we get 0012345678 and it is a number but
- hypothetically isn't a B12345678 a number too? Another source with account numbers may have alphas in the mix.
- What about those leading zeros? On conversion those values are lost and they could very well mean something.
- And in the world of dimensional modeling a numeric data type is usually categorized as a measure and if I add to to the value wouldn't it make the value a different account number?"
The rule is: always cast business keys as text/varchar.
The business key is a alphanumeric value that must remain constant for the life of the business entity, in other words it never changes and all context relating to that business entity is dependent on that key. Should the business key change and in the absence of a mapping that the business key has changed (from and to) then the context to that key is lost.
A business key by definition cannot be a measure, if we were to allow aggregations upon a business key it is not longer the same business entity. As a hard rule all business keys must be CAST to text so in the first instance we do not lose the integrity of the key. Here are a few scenarios why we CAST.
- Scientific notation
A flat file provided text for the business key but implicit conversion recognized the field as a number and thus representing the field in scientific notation loses the original grain of the business key
Business key arrives as: 5555555555555555555555555555
Conversion back to text: 5555555555555550000000000000
Business key is lost!
- Exceeding BIGINT
The same business key above actually exceeds BIGINT too!
- Alphanumeric Key
Data profiled may have only contained a sample of business keys and never revealed to contained characters that are not compliant to numeric data type, or we might have a situation that the data source itself has introduced alphas into the business key because the source system has exhausted all the available numeric values available to it. Whatever the reason may have been loading to an integer data type is setting up the hub for failure.
Attempting to load this business key will cause the load to fail
- Smart Key #1: Gaps and Position
An extreme example but let's say the above keys arrived as two different keys.
Key 1: 5555 5555 5555 5555
Key 2: 55 555555 555555 55
Upon conversion they will return as the same key: 5555555555555555
Key 3: 5555 9955 5555 5555, conversion = 5555995555555555 but a smart key business rule states:
6th to 7th character is country code
Lookup says 99=Australia and 95=Uzbekistan; the key arrived as from Australia but the interpretation is now that the key came from Uzbekistan.
- Smart Key #2: Leading zeros
00055555555555, conversion: 55555555555 and the leading zeros are lost!
- Multiple source
Key 1: 5555555555
Key 2: 55555AA5555
In conclusion casting all business keys to string/text resolves complexities that may be felt later in the life of the data vault. Should the text field be too short changing the text width to a longer field does not break the model in the contrast is a minor amendment to the column defined in the DDL. No conversion would be needed in the consumption layer either and having the business keys cast as text guarantees that what you see is what you get.
The only treatments placed on the business keys are: trim and uppercase/lowercase.