A major difference between Data Vault 1.0 and Data Vault 2.0 is the introduction of hashes as surrogate keys instead of utilizing sequence numbers. This is a paradigm shift in the way a data vault is loaded and implemented and opens the way for removing database-enforced referential integrity. If RI is enforced on a data vault it means that you need to specify foreign keys between a satellite and its hub/link and the link and its hubs.
This imposes sequential loading rules onto the data vault:
- The hub must be updated before related satellites and links
- The link must be updated before related satellites
On the longest path is a link-satellite waiting for a link to complete its updates and the link needs to wait for ALL of its related hubs to be updated. The problem escalates en mass if you could consider a link with ten or more hubs!
With a data vault based on sequence keys RI enforced at DB level is a requirement because you need to "lookup" the keys needed in a child satellite or link table from a hub/link+hubs.
With hash-keys RI is not enforced at the DB level but rather coded for in a reconciliation framework.
Now picture this; a file is modeled and mapped to load to a
- hub & satellite(s) - in the case of the staged file containing only one business key; or
- link+hubs & satellites(s) - if the staged file contains multiple business keys.
With hash keys calculated on the staged file it no longer matters which data vault artefact gets updated first; all loads from the staged content are directly loaded into hubs, links and satellites independently.
If the satellite has completed its update before the hub eventually the hub content will catch up, the same applies for links and their satellites and links and their hubs. This is part one of eventual consistency in data vault (File-level-EC).
Independent reconciliation tests are executed to check the integrity between the staged file and the target data vault artefacts and dependency tests are executed once related data vault artefacts have completed their updates.
- independent test - ex. are all updates from the staged file loaded to the target satellite? Executed after satellite has updated.
- dependency test - ex. do all my satellite hash keys have an entry in the adjacent hub table? Executed when related data vault artefacts are updated.
If all reconciliation tests pass then move onto the next file to load. And this is part two of eventual consistency in data vault (Enterprise-level-EC): every portion of the data vault can be updated independently and at any time of the day. Rather than rely on an overnight batch to run for ALL files; a new file with updates can be dropped into a landing zone hotplate and the ingestion framework picks up that file and loads it to the appropriate data vault artefacts as needed. Loads to common data vault target tables (example shared hubs) are managed by orchestration pooling or database level semaphores.
This means that the enterprise data vault is in a state of continual update and data naturally "flows" into the data vault. This is a key reason why sticking to the standards developed for data vault 2.0 is so important! Taking "cheap" shortcuts like attempting to solve application gaps in business vault means there is a point of maintenance in the EDW; and an additional "hop" in the time to value for analytics. If data vault is to naturally flex and grow with the business the majority of the technical debt must be pushed to the source application. The second major difference with data vault 1.0 is the move to insert-only loading paradigm. Downstream processes no longer need to wait for updates to close off records in satellites and in the case of BigData data is immutable and cannot be updated anyway (at least not cheaply)!
Having this flexibility and continual data flow is ideal and functions very well in a micro-batch scenario.
For point in time data & information consider utilizing PIT and Bridge tables to extract the data out of data vault for the business reporting requirement, whatever that may be. For everything else VIEWs are created directly over data vault to immediately serve dashboards and other business intelligence tools. The content can immediately be loaded to join-on-write platforms like Cassandra and Riak.
A platform like SnowflakeDB easily allows for schema evolution through its micro-partitioning technology and up to 90 days time-travel that on an EDW is a handy feature. Together with Snowflake's native support for unstructured data parsing and functions to extrapolate structured data from the VARIANT column type means that Snowflake is a natural fit for data vault and data vault a natural fit for Snowflake DB!
Final thought for a Friday afternoon; data vault offers by far the most flexible and manageable option when it comes to data retention and privacy compliance for your enterprise data. Data vault is naturally non-destructive to change but also offers the flexibility and implementation governance necessary for compliance.
"A model should not be built to last it should be built to change – only then can it truly last."