Clear all

Handling deleted records and getting the data out

Posts: 5
Topic starter
Active Member
Joined: 4 months ago


At work our source system sometimes gets its records deleted. So, for example, we have situations like this:

On date 2021-02-01 account L123 is in the source table:

AccountId	Balance
L123		100
L456		200

On date 2021-02-02 account L123 is deleted from the source table:

AccountId	Balance
L456		250

On date 2021-02-03 account L123 is back in the source table:

AccountId	Balance
L123		180
L456		280

Our satellite in Data Vault looks like this. We use a view to calculate the records end date.

AccountId	Balance		LoadTS		AppliedTs	
0000		0		1000-01-01	1000-01-01		(ghost record)
L123		100		2021-02-01	2021-02-01	
L123		180		2021-02-03	2021-02-03	
L456		200		2021-02-01	2021-02-01	
L456		250		2021-02-02	2021-02-02	
L456		280		2021-02-03	2021-02-03	

AccountId	Balance		LoadTS		AppliedTs	AppliedEndTs
0000		0		1000-01-01	1000-01-01	9999-12-31		(ghost record)
L123		100		2021-02-01	2021-02-01	2021-02-03
L123		180		2021-02-03	2021-02-03	9999-12-31
L456		200		2021-02-01	2021-02-01	2021-02-02
L456		250		2021-02-02	2021-02-02	2021-02-03
L456		280		2021-02-03	2021-02-03	9999-12-31

If I query V_SAT_ACCOUNT to check for account L123 balance on 2021-02-02 I get 100. But we need to report that, on 2021-02-02 account L123 balance is 0 because that day there was no record in the source system for that account. So we need to handle deleted records. We don't have any CDC tool but we do have the full table data to process. So, we have come up with these alternatives after reading the forum and some books:


Alternative #1
Create a PIT like table for each satellite that daily tracks the current record and its status (deleted or not). If there is no record for an account, the PIT record points to the ghost record in the satellite.

SnapshotDate	AccountId	LoadTS		Deleted?	
2021-02-01	L123		2021-02-01	N			
2021-02-02	L123		1000-01-01	Y			
2021-02-03	L123		2021-02-03	N			

For every account, everyday, a new record is inserted. Querying the balance of L123 on 2021-02-02 is quite straightforward. But we are concerned about the table growth as this approach tracks daily status, not changes.


Alternative #2
Create a Status Tracking Satellite that tracks when the records get inserted and deleted in the source system.

AccountId	LoadTS		Status	
L123		2021-02-01  	I			
L123		2021-02-02  	D			
L123		2021-02-03  	I	

We like this approach, but in order to show a report of the account balances on 2021-02-02, we would need to create a V_SAT_ST_ACCOUNT to calculate the record's end date. And then we would need to j-o-i-n V_SAT_ACCOUNT and V_SAT_ST_ACCOUNT and filter by date 2021-02-02, and use a c-a-s-e w-h-e-n SAT_ST.Status = 'D' then SAT.Balance = 0. Or we would need to create a view to u-n-i-o-n deleted occurrences from SAT_ST_ACCOUNT with SAT_ACCOUNT records and calculate the records end date. For something as simple as knowing the balance at any given date, getting the data out feels complicated.


Alternative #3

Based on the Status Tracking Satellite (alternative #2), create a business satellite SAT_BV_ACCOUNT using SAT_ACCOUNT and SAT_ST_ACCOUNT as inputs and i-n-s-e-r-t all records from SAT_ACCOUNT and add the deleted occurrences from SAT_ST_ACCOUNT as changes.

AccountId	Balance		Deleted?	LoadTS		AppliedTs	
0000		0		N		1000-01-01	1000-01-01		(ghost record)
L123		100		N		2021-02-01	2021-02-01	
L123		0		Y		2021-02-02	2021-02-02	
L123		180		N		2021-02-03	2021-02-03	
L456		200		N		2021-02-01	2021-02-01	
L456		250		N		2021-02-02	2021-02-02	
L456		280		N		2021-02-03	2021-02-03	

Queries would be simple and this would work nice for a few tables, but probably we need to apply this to all our satellite tables (+400), so it kind of looks like we are duplicating data in raw and business satellites, and I know that Business Vault should only contain what is not in the Raw Vault.


Alternative #4
Based on the Status Tracking Satellite (alternative #2), i-n-s-e-r-t deleted occurrences as changes in the Raw Vault satellite SAT_ACCOUNT, putting null/zero values and a "STS" RecSource value. I think this approach is called "tombstone" record. We can also add a "Deleted?" flag column.

AccountId	Balance		LoadTS		AppliedTs	Deleted?	RecSource
0000		0		1000-01-01	1000-01-01	N				(ghost record)
L123		100		2021-02-01	2021-02-01	N
L123		NULL or 0	2021-02-02	2021-02-02	Y		SAT_ST_ACCOUNT	(tombstone record)
L123		180		2021-02-03	2021-02-03	N
L456		200		2021-02-01	2021-02-01	N
L456		250		2021-02-02	2021-02-02	N
L456		280		2021-02-03	2021-02-03	N

This approach is easy for quering, but the load satellite process gets complicated, and we are inserting records that are not in the source system (the deleted ocurrences).


So, what are your thoughts about these alternatives? From what I read, alternative #2 is the recommended DV approach, but is there an easier way to get the data out? I'd really appreciate your insights. Also, sorry for the long post.


Edit: Looks like your Web Firewall doesn't like words like i-n-s-e-r-t.

Topic Tags
4 Replies