Working with source system that makes heavy use of identity columns and a common practice of creating new records instead of updating existing ones. This means the business key I selected exists multiple times in the source will not uniquely identify a single row. Take the following example data from a source table called "AppointmentType" (there are many other attributes of this record but I've only listed the ones that are important to this discussion).
AppointmentTypeID Code Name TypeClass DeletedDateTime
1 CCM Chronic Care NULL 2015-01-28
2 CCM Chronic Care Nursing Visit NULL
This data pattern in pervasive in the source. I chose "Code" as the business key instead of using the identity column (these two records are truly the same appointment type to the business). This manifests as a single business key in the hub and two satellite records in the satellite. I honestly don't see an issue with that as it tells the story of the business key within the table (it was deleted and then recreated as another record).
It does, however, cause issues with loads in WhereScape that is causing me to modify the templates which makes me wonder if I'm making a mistake by selecting "Code" as my business key (deviating from the standard???). I believe this might be more of a WhereScape issue than a data vault issue, but that's why I'm bringing this to the group.
Issue one: We essentially have a historical load of the business key
When this data is loaded into the satellite, dss_start_date is the same for both records so I must modify the dss_start_date. Essentially, there is a "history" of the business key I selected in the source table which means it has characteristics of an historical load. I realize start date isn't something that DV specifically states is needed, but I do see the value of this column in helping to identify the current row and it's certainly important within WhereScape code generation (especially when performing historical loads).
By default, the value of this column is the same as the dss_load_date. In order to use "Code" as the business key, I must override the dss_start_date by either mapping it to a create date from the source system (if it exists) or by applying a transform which essentially adds a second for each version of that business key that is staged up. This approach works well and allows me to load data historically (dss_start_date has to be different due to the unique index placed on the satellite).
Issue two: Every load must be incremental even if it's a small table of a few records
Because WhereScape always compares the latest version of the satellite record against the staged data, it will always add one record in the example I given based on the same two rows I provided above if a full load is performed each day. The latest record will not match the "deleted" record and will insert it. The second time around, the latest record (which is now the deleted one) will not match the real current record (and so on and so on).
This is quite easily resolved by either making all loads incremental to avoid seeing data we have already processed or changing the template to filter out records that it has already seen but again, this leads me to ask if I've made a mistake in selecting "Code" as my business key.
If I did make a mistake selecting this as the business key, then I'm left with using identity columns. Due to the pervasive nature of this pattern in the source system, I will essentially end up creating a source system DV. This is a terrible prospect for all the reasons that Dan mentions in his blog post about avoiding source system DVs and will end up causing me more work to present these entities in the business vault by deriving a BV hub (using code).