Assume the following scenario: the source system generates daily log files which are CDC driven. Meaning: for every change on a record in the source system, you receive a line in the log representing the complete new record, together with a timestamp of the change and some additional meta data (like who did the change). There is no indication of what actually changed.
These log files can arrive out of sequence and sometimes can contain data for periods which were already processed. For example: after a manual extract containing multiple days that span processed and non-processed days.
One line can (and will) be the source for different hubs, links and their satellites.
Loading hubs and links is quite straight forward following the known standards.
However, this will generate some challenges in how to process satellites into the data vault. Like:
- a new line does not necessarily mean a new record is added in all it's dependent satellites. Some of them will actually be changed while some of them will not. Hash differences by target satellite will do the trick.
- the assigned LoadDateTime is the same for the complete log file, as it was presented as a whole to be uploaded. When processing a change in descriptive data for a hub or link where it's changed from A to B and back to A again, this results in duplicates as the PK of the satellite is defined as the hub_key in combination with the LoadDateTime. And the last one is the same for the 3 records. To overcome this, we've added 'micro increments' to the LoadDateTime to make them unique again.
- as data can arrive out of sequence, you need to start your delta checking against the last available record in your satellite prior to the period you're processing.
- for all of the changes for one hub/link record, you'll need to do delta checking against the previous line in your log to prevent loading doubles and a data explosion. Otherwise all your satellites loaded from the same log will contain the same amount of rows
- all the above assumes checks against the change date (period). the easiest would be to replace the LoadDateTime by this change date, but that goes against the standards (no?) and it removes the traceability of when the record was actually added
Only then the resulting set can be processed into the satellite following normal procedures.
You might be tempted to also check the last line in your change set against the first line from the satellite following the data set you're processing to determine if this line could be deleted or not. There are some reasons not to do so:
- in general we never delete data from the data vault
- there might be other log files to process after this file and before the first line following this data set and already loaded in the satellite. In this case additional change records could be added in between.
So, a lot of challenges and a not so simple loading template as a result.
I'm curious as to your solution to this challenge! Do you see other challenges? Have you found easier/other solutions?