We have a legacy DV1.0 data warehouse that we want to convert to a DV2.0. I have a series of design / conversion questions I’d like some input on, please.
It is currently sourced from two source systems with no current overlap in hubs / links. Neither system has CDC set up. We pull an incremental data set daily (based on a timestamp column in one source system and creation / modification dates in the other) and also do a full pull of the keys for each table in order to track deletions (records missing from the full pull of the keys = deleted record from source).
When we initially designed our data warehouse , our data discovery determined that there were very few source tables with unique values to be used as business keys, so the decision was made to use the source system primary keys for these tables (the majority of the warehouse based on this system). Should we go back and revisit this? There was (and still will be) no appetite from the business to go back and correct the data to make one or more fields in each table useable as a business key.
Would it be possible to create a Same As link in the Business Vault to apply logic and match records in the RV that are potential duplicates? (I am thinking of Person Records in this scenario, as for one of our existing systems there can be multiple name records for any given person and complex logic to choose the “main name” record. The other source has only partially complete names in it as it is a Emergency Dispatch system and the call takers don’t always have the opportunity to get complete details on the caller or the person involved in the incident being reported).
Would you ever consider naming your HUB / Link Business Keys as <HUBName>_Business_Key? For example, we have TBL_HUB_Person with a BK of Person_Id, would you change it to be less source system specific and called it Person_Business_Key?
The satellites are not named as source specific tables – as currently there is only one source for each of the entities (Hub & their Satellites or Links & their Satellites). Should the names of the Satellites be changed to reflect the source system it is sourced from?
All our data columns have been renamed from the source names to something more user friendly, I have read recently that this isn’t a good idea and that the column names in the Raw Vault should reflect the column names of the source system. Is this true? Should I rename all the detail column names in the satellites to reflect the Source column names? (This would add to the argument to change the satellite names to be more source specific).
All our Hubs and Links have Effective satellites, I am planning on replacing these with Status Tracking satellites. Should I just replace the HUB Effective satellites with Status Tracking Satellites and keep the Effective satellites on the Link tables?
All non effectivity satellites have five processes to load them (New, Insert, Update, Delete and Placeholder (with would be the equivalent of Ghost record in DV2, I assume).
All our satellites have ETL_Load_End_Date fields which get populated when the Update process is run and finds newly inserted records for a given key so the previous un-end dated sat record for the key is end dated. Or is end dated when the record is no longer found in the source. With the DV2 insert-only paradigm I am struggling how to denote when records are no longer found in the source. I can use the Status Tracking satellite to keep track of the overall HUB / LINK existence but some of the satellites are populated from child tables (for example, the Person Hub is populated from source Person table however Person Names come from a PersonName table and there can be multiple active records in this table for any existing person and these records can be deleted but the HUB record still exists at source). Is there any way to denote the deletion of a satellite record at the satellite level?
Below are diagrams of the old warehouse and the proposed DV2.
Do the DV2 structures look correct?
Should the tables be named differently?
Would you ever differentiate between HUB satellites and LINK satellites in the naming of the satellites?