After researching and discussions with Data Vault practitioners modeling on top of CDC sources have some potential patterns that one can consider
Our platform uses a replication tool to replicate sources with CDC enabled to our data lake. Our data lake contains all changes, updates, deletes and has a CDC date. I.e. all tables are key'd on this date in addition to their primary keys as it tracks any updates and deletes on that primary key.
The conclusion that we have come to is that there are several methods to load this data into the Raw Vault from this Historic datastore (as it is called on our platform) in our data lake.
- Always load only the "current view" (aka only read the record active as at now).
- + Standard loading pattern
- - Losing of all intra-batch changes
- Always load from the historic data store and keep the CDC date as a subsequence number
- +Standard loading pattern
- + History is in-sync, all intra-batch changes are captured in Vault
- - All tables has a subsequence number which has knock-on effect in effort on consumption of the data
- Develop a mechanism of "looping" or "replaying" off the CDC source. I.e. incrementally load all changes captured since previous load. Our team is leaning towards this option
- + Standard loading pattern (once the mechanism is put into place)
- + History is in sync, and false changes in CDC process caught automatically by Vault loading steps
- - Additional mechanism to implement
However, since then, I have read on some blogs that there is a potential to create virtual satellites on top of CDC source (or the historic data store as it is called on our platform). I know there are some things too keep in mind
- CDC date will be equivalent of the Load date and it must be immutable
- CDC can and does produce false updates, this will cause cases with subsequent records to be "duplicated" in satellites (i.e. same hashkey, hashdiff but with different "load date"/CDC date)
I understand that the major reason why one would not want to build virtual satellites is because it does not scale with large volumes. The additional Vault fields (and particularly) the Hash keys are calculated on consumption and when the volume because large the performance will start to deteriorate.
Any comments or guidance that can be shared on this matter will be appreciated.