Notifications
Clear all

How to Detect Multiple Satellite Changes for the same Business Key?


Posts: 157
Topic starter
(@hook)
Estimable Member
Joined: 2 years ago

One of the topics covered in the CDVP2 course was the loading of satellites and how to ensure that only new data is loaded. On the face of it this logic seems very simple:

  • Get the latest satellite record (by load date) for each business key or keys
  • If the values in the columns of the incoming data do not match this latest record then load it, otherwise ignore the record.

However, consider this scenario. Say we have some existing data in our satellite showing three values for a business key on consecutive days. Also assume that the Last Update DTS is a sub sequence column so we only want to detect changes for First Name and Last Name.


Hash Key Load DTS Last Update DTS First Name Last Name
======== ================ ================= ========== ==========
0x1234   1-Jan-2020 02:00 31-Dec-2019 13:29 John Doe
0x1234 2-Jan-2020 02:02 01-Jan-2020 09:45 Fred Bloggs
0x1234 3-Jan-2020 02:01 02-Jan-2020 17:32 John Doe

Now consider that there are three changes to the source record since the last load and that we are running this load on the 4th Jan.


Hash Key Load DTS Last Update DTS First Name Last Name
======== ================ ================= ========== ==========
0x1234 4-Jan-2020 02:03 03-Jan-2020 10:05 John Doe
0x1234 4-Jan-2020 02:03 03-Jan-2020 11:20 Joe Bloggs
0x1234 4-Jan-2020 02:03 03-Jan-2020 17:32 John Doe

Which of these records should be loaded? Applying the load logic described above we would first identify the latest record in the satellite, in this case the recorded loaded on the 3rd Jan 2020 (John Doe). If we exclude records that have the same column values, then we’d only load the second record (Joe Bloggs) as first and third records (both John Doe) have the same values as the last record on the satellite.

But this isn’t the desired result, we want to load the second AND third records.

Suddenly the load pattern doesn’t look quite so simple. The only way I can think to solve this problem without resorting to single record loads is to apply a Lag windowing function over the union of the incoming data and the latest record in the target satellite as follows:

To simplify the explanation, I’ve added a Hash Diff column. The first row is the last satellite record, rows 2-4 are the incoming data to be loaded:

Hash Key Load DTS         Hash Diff Last Updated DTS   First Name Last Name
======== ================ ========= ================== ========== =============
0x1234 3-Jan-2020 02:01 0xABCD 02-Jan-2020 17:32 John Doe
0x1234 4-Jan-2020 02:03 0xABCD 03-Jan-2020 10:05 John Doe
0x1234 4-Jan-2020 02:03 0xCDEF 03-Jan-2020 11:20 Joe Bloggs
0x1234 4-Jan-2020 02:03 0xABCD 03-Jan-2020 17:32 John Doe

Now apply the Lag function on the Hash Diff column ordering on the sub sequence column(s) so that we can see if there are any differences between the current any previous versions. We’d use code something like this:


[Last HashDiff] = LAG([Hash Diff], 1, NULL) OVER (PARTITION BY [HashKey]
ORDER BY [Last Update DTS])

Resulting in the following staging data:


Hash Key Load DTS Hash Diff Last Hash Diff Last Updated DTS First Name Last Name
======== ================ ========= ============== ================= ========== =========
0x1234 3-Jan-2020 02:01 0xABCD NULL 02-Jan-2020 17:32 John Doe
0x1234 4-Jan-2020 02:03 0xABCD 0xABCD 03-Jan-2020 10:05 John Doe
0x1234 4-Jan-2020 02:03 0xCDEF 0xABCD 03-Jan-2020 11:20 Joe Bloggs
0x1234 4-Jan-2020 02:03 0xABCD 0xCDEF 03-Jan-2020 17:32 John Doe

 

We then load those records where the Hash Diff and Last Hash Diff are different and ensuring we don’t load the record with the Last Hash Diff value of NULL (as this is the is the existing satellite record). This logic ensures that only the third and fourth records are loaded as required.

To be frank this all seems a bit messy to me. Does anybody have a simpler solution?

Reply
3 Replies