We've heard the story... we know the outcome... but what if we really don't have to pay the Pied Piper?
Pied Piper of Hamelin
As the story goes the Pan Piper was hired to lure the rats away from Hamelin using his magic flute and because the town refused to pay the Pan Piper after he completed his task, he used his magical flute to lure away the town's children.
In Data Vault our discussion relates to extracting data out of the vault. Common feedback is that:
1) the queries to get the data out are big and difficult to understand,
2) extracting data is too slow
Both are dealt with by employing point-in-time tables (PITs) that enable EQUI-JOINS between the PIT and the satellites whilst the satellites themselves will each contain a GHOST record to complete the joins.
BUT... instead of removing the problem with LEFT OUTER JOINS to get data out of the vault; we have moved the problem to the process of creating PIT tables themselves.
We either have to pay the Pied Piper in PIT creation or in Data Vault extraction.
How do we construct PIT tables?
PIT tables provide the EQUI-JOIN necessary for retrieving the data out of Data Vault by utilising the index structures in Data Vault tables and the index in PIT tables themselves. Using a PIT to extract the data makes use of an index-on-index join; and that is why PIT tables are so valuable!
Populating a PIT needs to employ a LEFT OUTER JOIN between the HUB and the adjacent SATELLITE and substitute the absence of a satellite record (for a business entity or relationship) with a GHOST record until such time as the business entity has a record to speak of for that particular point in time (from source-raw vault- from derived output- business vault).
Querying from Data Vault without a PIT table or building the PIT table itself requires a LEFT OUTER JOIN. This is the penalty. This is where we have to pay the Pied Piper.
When does a PIT change?
We query the data that a PIT relates to in order to determine the next slice of data to add to a PIT. Coalesce if there are no matches with the static hash key value and load timestamp for a GHOST record. A new point in time and a new delta of applicable adjacent satellite hash-keys+load timestamps are loaded.
The only way we know that there is a delta is in comparison between the staged data and what is in the data vault's satellites. The PIT table code is ignorant of the deltas but simply pulls what are the applicable hash key and load timestamps for the current point in time.
If you have identified what that delta is then would it not be prudent to pass that delta hash-key+load timestamp to the PIT while the target satellite is being updated?
The breadcrumb approach.
A PIT changes when the adjacent satellite changes. That could occur at the same time. And under Data Vault 2.0 this is possible.
How do we know the satellite has changed?
The staged HashDiff differs from the Satellite HashDiff and that determines what will be inserted into the satellite. Include a HashDiff in the PIT per Satellite to achieve the same thing.
- The HashDiff in a Satellite helps to determine if a new record should be inserted per hash key
- The HashDiff in a PIT helps to determine if the insert in a point in time should be the previous point-in-time timestamp or the new staged timestamp (this rule covers both delta and snapshot loads to data vault).
With orchestrated tasks you determine the rules for your PIT updates; ex.
- all satellites referenced by a PIT must be updated per batch run before the update to the PIT is committed;
- staggered load for intra-day inserts to a PIT.
For the former this will require that the satellite's breadcrumbs be left in a temp store until such time as all the required breadcrumbs (all the satellite hash key, load timestamp and hashdiffs) are ready to be compared and loaded to the PIT.
The latter approach does not need breadcrumbs.
BUT-- can you rely on technology to determine that?
Depending on the technology you could rely on a Cassandra-style compaction.
Compaction updates the current record by cluster-key, define that key to be the hub hash key + load timestamp and all other attributes in the PIT can be updated in place without writing a comparison update transformation. The newer record by update timestamp is available for querying while the older record is given a tombstone. When the in-memory table is persisted to a solid state the tombstone records are discarded. This is done naturally when the MEMTable fills to its capacity and the contents are flushed to disk. While the table is in memory compaction means that a PIT table HashDiff is not necessary.
As for fault tolerance the updates to the in-memory table are recorded in a commit-log for easy fault recovery.
The decision on the frequency of persisting a point-in-time becomes really important because a PIT does not necessarily record every change to a business entity; it is built to retrieve the applicable change at a point in time. If the update frequency is daily and the data vault satellites are updated multiple times a day your daily PIT will point to the latest record up to that reporting PIT date on top of what was already loaded to the PIT at the previous run.
A major difference between data vault 1.0 and data vault 2.0 is the use of hash as a surrogate key which allows for parallel and independent loads to the data vault artefacts. DV1.0 must employ staggered loads because in order to load a satellite adjacent to a hub the satellite load pattern must look-up the sequence key to use in the satellite. A link table must wait for all adjacent hubs to complete their loads before the link table itself can lookup the hub's sequence keys to utilise in order to represent the relationship in the staged source file. The link-satellite must wait for the link to complete its load before the link-satellite can lookup what link sequence surrogate key it needs to use.
With DV2.0 all the surrogate keys being hash-keys every data vault artefact can be loaded in parallel and independently. PIT and Bridge updates need to query the underlying data vault structures in order to determine the next delta. BUT if I can include the HashDiff as an attribute in the PIT then I do not need to pay the Pied Piper. I can take the same delta loaded to the adjacent satellite to load that portion of the PIT that needs changing. The delta load to PIT is oblivious to the load to the adjacent satellite.
Of course some consideration for fault tolerance and reconciliation is needed (I should not have keys and load timestamps in the PIT that do not exist in the satellites) but upon satellite load completion the two should be eventually consistent.
When I persist the data in the PIT the update does not need to be determined by a PIT load date, it can in turn be determined by a report date - i.e. the date of the state of the data I need.
You do need to consider what a PIT is there for; fast access to the underlying satellites round a hub or a link and therefore the same performance paradigm applies.
- Keep them thin
- Keep them short
- Keep them indexed
This is an idea I had but not implemented; to make the updates of PITs faster and eliminate the need for staggered loads. My objective is to load every data vault and query assistance table in parallel where I can. The idea assumes that each piece of technology mentioned would be available- even indexes.
Feel free to find gaps in the proposed approach.