The guidance for staging data sources (as I understand it) suggests calculating where possible all hashkeys within the staging area, including those that will ultimately source the link hashkeys.
One of my data sources is a relational database that uses surrogate meaningless keys as the primary key of each source table. To add to the complexity, what uniquely identifies the record from a business perspective does not always sit inside one table - sometimes there is business information held in a parent table that has to cascade down.
The primary key of each table is based on a single surrogate key. But from a business key perspective, the BK attributes of the child are often insufficient to uniquely identify the record - you must also join in some cases to the parent and retrieve the associated business key attributes from there. Together the attributes from parent and child identify the child record from a business perspective (sometimes there is a grandparent also involved).
Parent_ID (PK, FK)
some_child_name (part of BK)
The business key is a combination of some_parent_name and some_child_name
As you can imagine, over several tables this can be quite complex as keys cascade down through several tables. Also, for standard foreign key relationships, a similar issue exists with resolving the FK surrogate key with the business key attributes from across several tables.
I am presuming the most efficient way to handle this type of pattern is to resolve the relationship in the extract of the data, even if it adds some additional processing to the extract. Better that than increasing the complexity of the vault load processes.
STG_PARENT is simple as the extract process queries just the table PARENT, whereas STG_CHILD joins across CHILD and PARENT in the extract to retrieve some_parent_name which we store in the STG_CHILD table. This way we can define the BK and hash columns fully in the staging table without having to perform any form of lookup to the parent in the staging table or during the link table load process (i.e., holding the surrogate Parent_Id FK in the SAT_PARENT table and looking up its hashkey as part of the Link load).
Presuming the above is sensible I do have a further query about what to do about potential FK values that don't exist in the hub?
Completely missing mandatory and optional values can be dealt with easily and the NULL values translated to the appropriate null and unknown hub hashkey values.
What I am not sure about is what to do with values that exist but don't exist in the Hub (i.e, I extract the parent value and derive a hashkey value for it but for whatever reason the separate Parent staging and hub load process didn't load it)
Should I rely on referential integrity of the source and not include any validation that the derived parent hashkey - and presume the staged some_parent_name value in STG_CHILD if supplied is accurate (and will have been loaded into the HUB_PARENT table anyway by a previous hub load process)?
Or do we need to validate the derived parent BK/hashkey from the HUB_PARENT table first before attempting to load the Link table, so that we can translate it into the unknown hub hashkey value if necessary?
The additional validation step seems to be a belt and braces approach but may be overkill which will negatively affect performance with this additional validation step.
What are your thoughts? - pre-validate the derived PARENT hashkey/BK first, or rely on referential integrity doing its job and presume the data will be valid?