Every so often I get this question: Can or should the Raw Data Vault be virtualized in it's entirety? ie: 100% view driven off of a persistent staging area.
Here are my thoughts around this:
- I've always held the belief that one-day, DV modeling will be conceptually based (as it should be) and that the physical will depend entirely on the volume of data, and the abilities of the platform to perform, along with any particular special optimizations which the platform may provide.
- I've always taught the DV as conceptual. We have (yes, we do) in class, take to the physical level to ensure that it is built properly, and because most platforms today do not have the power or capability to deal with 100TB+ data in a "virtual fashion". However, that changes with Snowflake - and to some degree, Teradata, and possibly with other engines today...
At the end of the day, if virtualization is going to take place over 100% of your raw data then the following must be adhered to:
- technology must be sufficiently powerful to handle the volume of data
- technology must be fully capable of executing multiple levels of stacked views in parallel, and optimizing the views to the flattest possible mathematical result (ie: set logic folks)
- Technology must be fully capable of *even more joins* at the PSA level to "union together disparate data sets" like unique lists of business keys, and execute this in parallel over large volume data.
- delta data must live in the staging area - in other words, you cannot "operate" a data warehouse (not even a SCD type 2 from Kimball) without "delta's" already existing. There are a series of requirements for the PSA to follow before it can have reliable and consistent data virtualized in a view - foregoing the Raw DV table structures entirely.
- Security must not be an issue. I don't know if you are aware of this, but securing data in a view, over Billion + record tables underneath is not an easy (or recommended) task.
- Joins on the PSA must be optimized.
- Hashes might not be used - neither would sequence numbers. joins would have to occur on true business keys in the PSA in order to take advantage of co-location and micro-partitioning, etc... The only time a hash might be added, is when you have a hybrid Raw DV (part virtual, part physical).
Requirements for the PSA in order to even begin building a virtual Raw DV:
- PSA must house delta driven records
- PSA must be appropriately partitioned
- PSA must have defined metadata
- PSA must have some form of defined structure (including XML or JSON or Row/Column)
- For VOLUME: PSA must be partitioned in some form (either through design or by the database engine)
It's not as easy as you might think.... You also need to consider building physical status tables containing record dates and times stamped by the system, and record deleted date tracking needs to be accomodated for. None of these things can happen in real-time in view contexts UNLESS you are joining directly to a stage table that houses a FULL and COMPLETE audit trail for every record in the PSA.
By saying that - I am telling you: you *must* have a full and complete audit trail for each and every PSA table to power the views and their decisions on which records to include or leave out. Even if you or your team create the audit trail (because none exists on the source).
Is pure virtualization the future?
In one word: YES.
I have a theory that goes like this: at one point in the future, data will be "landed in one place" and will become "too big to move or restructure"
It is at that point, that technology vendors must be ready & the world must be ready, to move to views on their data. It *may or may not* be an actual virtualization engine, most likely not - most likely it will have to be views directly on top of the data set, and processing occurring right where the data lives. We saw this argument before, with the "rise of NoSQL", just that NoSQL when it first was produced was far too slow to handle massive volumes with a reasonable response time.
See my good friend Roelant Vos' posts:
- Virtualizing your Hubs And Links
- Virtualizing your Satellites
- Virtualizing your Driving Key Satellites
Some conclusion based thoughts you might consider:
- Data Vault Modeling is about the business concepts - always has been, always will be
- Data Vault Modeling is conceptually oriented (as a result of #1 above)
- Data Vault Modeling does not have any documented statements that say : it must be physical
- Data Vault Modeling will vary in the physical form in order to optimize for the engine it is built on, otherwise you couldn't build Data Vault Models successfully on MongoDB - and it has been done and proven successful.
- Physical optimizations for accessing and loading data will vary based on technology
- A virtual data vault does not break any standards, as long as the conceptual data vault follows the standards put forth, and the data sets meet the standards as well (ie: delta driven, business key managed, etc...)
What's the dream?
The dream is 100% virtual Raw Data Vault going forward, to have technology and a platform powerful enough to execute and perform against massive data sets (hundreds of billions of records per PSA table) and join them together quickly and efficiently. This will be the optimal, most efficient and most agile strategy going forward.
If you can build one today, do so, however my recommendation?
START SMALL - Don't boil the ocean. Virtualize the "smaller data vault tables" in your organization (for example: reference DV) Grow with your platform - understand your platforms' capabilities and performance metrics, and understand your data sets and growth patterns, and don't forget the list of PSA requirements above.
Would LOVE your feedback!! Please *add* requirements, thoughts, comments to this post - this is not set in stone folks.