Solving the Time Crime of Back-dated Data in Your Data Vault

Analytics assumes that data movement will follow a linear path into a historical repository to represent the correct sequence of events. That is, if today is Thursday and we have loaded data for Wednesday, we then do not expect data from Tuesday to arrive today, because all the data up until today should have already been loaded. Whether it be a missing batch file or a missing data record – a Tuesday file may have arrived and loaded but could have been provided without a complete set of records.  There are a myriad of reasons why these scenarios might occur (a locked record due to an update in the database may cause a push file produced from the source system to be incomplete). These are real scenarios in loading a data warehouse and it can skew what we know of a business entity (e.g. customer) and may derive erroneous analytics for that business entity or worse, the wrong facts are reported to a regulatory body or to the customers themselves.

The late record is (in data parlance) known as “out-of-sequence” data and to us data professionals this is a time crime. The repercussions could have legal ramifications, reputation risk, and loss of market share (etc.) if the analytics is influenced by something that might be considered as trivial as data sequencing issues.

 

The crime scene: “out of sequence” data in Data Vault 2.0.

Let’s set the crime scene; we are loading a customer hub and a customer address satellite.

  1. For hash key ‘159’ (hashed business key) the data is loaded on Monday with a ‘City Name’=’Sydney’. We have never seen this business entity before, and the address satellite table does not have an entry for this hash key; therefore, we insert the record into the satellite with its record hash (hashdiff) of ‘6d’ (hashed ‘Sydney’= ‘6d’). On Tuesday no file was pushed to the data warehouse from the source system; therefore, there is no update in the address satellite for hash key ‘159’.
  2. Wednesday’s file arrived and details of hash key ‘159’ is present in that file; ‘City Name’=’Sydney’. For hash key ‘159’ Monday’s ‘City Name’ is the same as Wednesday’s ‘City Name’ (staged hashdiff ‘6d’ = satellite’s hashdiff ‘6d’) and therefore there is no change and no insert is performed to the address satellite for that hash key.
  3. On Thursday, Tuesday’s push file arrives with data for hash key ‘159’; ‘City Name’=’Brisbane’ with a hashdiff = ‘6e’. A comparison is performed between the staged file and the ‘older’ record in the address satellite (Monday) and the ‘City Name’ differs (‘Sydney’ <> ‘Brisbane’ | ‘6d’ <> ‘6e’); therefore, the late arriving record is inserted to show the change. On Friday when we query the address satellite to retrieve the current address for hash key ‘159’ the ‘City Name’ value will return ‘Brisbane’.

This is incorrect. According to the staged data provided from the source system on Wednesday, the staged ‘City Name’ is ‘Sydney’ for that hash key; but the latest ‘City Name’ in the satellite for that hash key is ‘Brisbane’.

A brief walk-through on the crime scene below 

On the left is the staged data (deltas), the middle is the data vault domain and on the right is the timeline represented in the satellite for the hash key. Each day (1-Monday, 2-Wednesday, 3-Thursday) is horizontally represented. (3) contains the staged late arriving record; the day the time crime is committed. Below each scene we describe the event chronology.

 

  1. Monday is an INSERT because the record by hash key is new and there is no hashdiff to run a comparison for this hash key.
  2. Wednesday has no INSERT because the staged hashdiff and the satellite hashdiff for the hash key matches; Monday’s hashdiff of ‘6d’ = Wednesday’s hashdiff of ‘6d’.
  3. Late arriving is an INSERT because the hashdiff is different when compared to the older record (Monday hashdiff=’6d’ and Tuesday hashdiff=’6e’); it now means the timeline recorded for this hash key is wrong as depicted by the timeline on the right.

How do we dynamically deal with this crime? Here we propose a data-driven approach.

 

A few things you need to know about Data Vault 2.0

Data Vault 2.0 is an INSERT-ONLY paradigm. Data on Big Data platforms is immutable and update operations are performed by persisting the data to a new persistent state (DELETE and INSERT occurs in the background). This makes UPDATEs expensive and therefore avoided especially in environments managing millions of records. For Data Vault 2.0 this means that records in a satellite table have a business start date but no end date. Instead end dates are simulated using the LEAD() windowing function when querying the data. For example:

 

,coalesce(lead(START-DATE)over (partitionbyHASH-KEY orderbySTART-DATE),cast(’31-DEC-9999 23:59:59.998’asdate))END-DATE

 

This feature in Data Vault 2.0 is critical because it means that the end-date is dynamic and will change dependent on records inserted after the current record by the same hash-key.

 

The crime fighter: Record Tracking Satellite eXtended.

Record Tracking Satellite (RTS) in Data Vault is used to record every instance a business key or a relationship is sent from the source system. This can be incredibly handy in detecting if the source system has deleted a business entity or relationship. It can even be useful in detecting if a record affected by GDPR’s right to erasure (“right to be forgotten”) is still being sent by the source. Data aging rules can be applied to those keys using RTS to identify what may be deleted business entities in the source system.

This blog discusses extending the usage of RTS to record every instance a record was sent from the source system. Let’s revisit RTS table structure and introduce the column needed to extend RTS for time crime fighting.

Column

What it does

Hash-Key

Hashed representation of business keys or relationship stored as binary data type for optimal join performance.

 

Load-Timestamp

The timestamp marked on each record recording the precise date & time the record has entered the data warehouse; the processing time.

 

Applied date timestamp

A date timestamp value that is chosen for the applied time-line, particularly for Satellites – when Load Date should not be altered.  An example of this applied date (in real-time) may be transaction date, where the transaction date or message creation date  would be different than the load-date in the satellite.  Another example of this (in batch) may be: source extract date, or source record update date – where the date itself is “older” or back-dated, compared to the load-date of the Satellite.  A final example of this is: a business selected date/time – where the business has full control over the value, and may choose to “back-date” the data, again, something different than load-date in the Satellite

 

Hash Difference

(Amended) The record digest of all attributes on a record except for the business keys, load-timestamp and record-timestamp and other metadata tagged onto each record in the Data Vault.

 

Record Source

 

File/Table name of the source system that provided the record that may include file location or schema name.

Record Target

(New) The satellite table associated with the record.

 

*Data Vault 2.0 has a few more metadata tags but these listed above are needed for illustration.

How RTS-X fights time crime

 

Without the need to manually deal with time crime that we saw in the crime scene introduced, RTS-X will continuously enforce time crime prevention as and when they occur. They are the good Samaritans that react when needed and you don’t need to call anyone or stop anything while the time crime is being policed.

We will talk about the various crime scenes RTS-X is designed to deal with by exploring the customer address scenario we introduced earlier. Think of RTS-X as those citizens on patrol going about their business of recording record hashes, assisting in detecting changes for satellites and immediately act when a time crime is detected.

Petty crime: Late arriving records that do not need special treatment in Data Vault 2.0. Let’s see how RTS-X will correct these timelines. We have added RTS-X to each crime scene in the Data Vault column and a blue timeline for RTS-X under the event chronology column. 

Crime Scene 1: Late-arriving record is the same as the older record and the newer records.

  1. New record arrives; an INSERT is performed.
  2. Wednesday’s staged record is the same as the previous record, no INSERT is performed because the value is the same as what is in the satellite; ‘Sydney’.
  3. The late arriving record from ‘Tuesday’ is compared to the older record and it is the same and therefore no INSERT is performed.

Notice that the event chronology shows that we have an INSERT for every record received in the RTS-X timeline but only the changed record in the satellite timeline.

Crime Scene 2: Late arriving record is different to the older record and the newer record.

  1. New record arrives; an INSERT is performed.
  2. Wednesday’s record arrives and it is different to Monday’s record (‘Brisbane’ and ‘Sydney’). We INSERT the record for ‘Brisbane’ because the hashdiff generated for ‘Brisbane’ differs to what was recorded on Monday; the hashdiff for ‘Sydney’.
  3. The late arriving record is compared to Monday’s recorded hashdiff and it differs too (‘Melbourne’ vs ‘Sydney’). An INSERT is performed.

Notice that every day’s hashdiff differed and so the timelines for RTS-X and the satellite are identical. Remember that the satellite will contain the details that make up the hashdiff and RTS-X only records the delta hashdiffs. Although the example only has ‘City Name’ as an attribute in reality a satellite’s number of attributes can be as wide as the platform supports. The generated hashdiff will always be of the same consistent size regardless of the number of attributes that are hashed together.

 

Crime Scene 3: Late arriving record is the same as the older record but different to the newer record

 

  1. New record arrives; an INSERT is performed.
  2. Wednesday’s hashdiff differs to that of Monday’s hashdiff and therefore an INSERT operation is performed to the satellite.
  3. The late arriving record is the same as the previous record (‘Sydney’ and ‘Sydney’); this means that no INSERT is performed because no actual change has occurred between Monday and Tuesday for this hash key.

The chronology shows that only RTS-X has recorded that there was a delta record staged from the source system, the satellite shows that there was no change between Monday and Tuesday.

Felonies: felonies are more serious and need an additional code snippet to solve the time crime.

 

Crime Scene 4: The older and newer records are the same, but the late arriving record is different to both.

This is the example we described earlier.

  1. New record arrives; an INSERT is performed.
  2. Like crime scene 1 the record is the same as Monday’s record (‘Sydney’ and ‘Sydney’) and therefore no INSERT is performed.
  3. The late arriving record (‘Tuesday’) differs from Monday’s record, BUT Monday’s hashdiff and Wednesday’s hashdiff are the same. We know this because this fact is recorded in RTS-X (Monday ‘6d’ = Wednesday ‘6d’). We use this comparison to COPY Monday’s record from the satellite and INSERT it as Wednesday as if we had received the records in the correct order.

The event chronology will now show that the timeline was corrected.

 

Crime Scene 5: Late arriving record is the same as the newer record but different to the older record.

  1. New record arrives; an INSERT is performed.
  2. Wednesday’s record differs to that of Monday (‘Brisbane’ vs ‘Sydney’) and thus an INSERT is performed.
  3. The staged late arriving record carries the value ‘City Name’ = ‘Brisbane’. This differs to Monday’s record too and in fact shows that the change to the hash key occurred earlier than Wednesday. An INSERT operation must be performed to show this fact but consequently the satellite will contain a duplicate.

The event chronology now shows that a duplicate record now exists in the satellite as Tuesday and Wednesday’s hashdiff is the same. This duplicate is intentional but unlike dimensional modelling the impact is not exaggerated and in fact can be discarded with a well-constructed view and/or an update to an adjacent point in time (PIT) table.

 

Keeping Data Vault 2.0 free of Time Crime

Now that you see the value of RTS-X let’s discuss time crime prevention. RTS-X is law enforcement and to ensure that all satellites are protected against time crime we will have an RTS-X posted at every hub and link apart from the non-historized link. The solution proposed in this blog is completely data-driven but with a minor caveat; implementing this data-driven self-healing capability does introduce a bi-table dependency.

A dirty read from RTS-X is ok, a clean read is ok too, but a read that is too dirty is not ok (pun intended).

It’s about the past and the future but not the present records in RTS-X. RTS-X has been deployed at a client to stop time crime from occurring in satellites, multi-active satellites, satellites with dependent-child keys, business vault satellites, status-tracking satellites, record tracking satellites (the regular ones) but not effectivity satellites. Effectivity satellites derive start and end dates that make it too difficult to implement using RTS-X. The time correction in effectivity satellites must occur in the start and end date derivation already and therefore the additional code snippet used with RTS-X will never be utilized.

RTS-X can be used conditionally, meaning if your ingestion pipeline sees a late-arriving record then RTS-X can be utilized instead of the regular ingestion patterns for the noble satellites. But then again if you do that you do have two sets of code to maintain. Code using RTS-X to update satellites and code that does not. We have seen RTS-X outperform the regular loading patterns for staged files as big as 20 million records but degrades thereafter (on SQL Server). That is where things like partition strategies can be used to keep the train rolling at the pace you want.

 

Wrapping up the scene

 

This is a data-driven approach and not without caveats; as you have seen this makes the satellite loading dependent on how up to date RTS-X is.

Bi-table dependency– if your operational processes ensure that time crime is prevented by checking that the next file in a sequence is the next file expected then RTS-X will not be needed; but it does mean you have to wait for the file to arrive before you can continue processing subsequent files.

Storage– compression is required and RTS-X is wafer thin and the more history it has the more effective it can be. In a historical back-loading scenario you may consider loading to a new satellite instead and therefore not need RTS-X to ensure adequate historical blending occurs during the back-load.

Number of tables– yes, an additional table to each hub and link is proposed however in the era of commodity hardware and scalability made available through storage mediums such as AWS S3, is that really a worry?

Availability of data– if your platform is constrained and inserts to RTS-X hold up the availability of the satellite RTS-X is associated with then consider enforcing that the satellite is updated before the RTS-X, ala dirty reads.

Table locking– a central RTS-X to each hub or link will have the same table contention as hubs or links and therefore a form of pooling is required that ensures not more than one task is updating RTS-X. Alternatively you could deploy an RTS-X to each satellite around a hub and link and this would negate the need to record which satellite a hashdiff in RTS-X is associated with.

Reports, dashboards and views– updates to the past may cause a time-ripple. These time ripples may have to be explained; do not allow this approach to allow automated operational procedures to suddenly become lacking because “RTS-X will fix it”.

Snapshots need updating– depending on how far back your logarithmic point-in-time tables (PITs) go you may need to swap out and swap in an updated set of keys and dates.

There is still a cost to performing back-in-time corrections to the data warehouse.

RTS-X is merely an unsung hero in the fight against time crime.

Not for streaming data– non-historized links are not a candidate to have an RTS-X post to. These links contain descriptive attributes within the link itself and never expect the number of attributes in the stream to change. The data loaded to non-historized links are immutable, transaction-based; data ordering is expected to be managed before the data reaches the Data Vault.

This extension to Data Vault 2.0, which I co-authored, was presented by Certus Solutions at WWDVC 2019.

Patrick Cuba

Senior Consultant
Patrick has nearly 20 years working on data inspired problems utilizing his experience and he has embraced Data Vault 2.0. He works by understanding the business before innovating the technology needed to ensure that that his data driven delivery is agile and automated. He is Data Vault 2.0 certified and regularly contributes to Data Vault Alliance.

Other articles on Data Vault Alliance:

Bring Out Your Dead Data


Facebook


Twitter


Google-plus

Leave a Comment