A summary of some of the sins I have witnessed...
Please add below if you've seen stranger data models and implementations of DV2.0! I have seen data vaut models called Data Vault 2.0 but don't follow the standards as defined by Dan himself!
- "Business Vault is the layer exposed to the business" - the implication is that Raw Vault is not exposed to anyone and Business Vault is the "conformed" business layer with column names in the form digestible by the business. This is an incorrect implementation of Business Vault. Business Vault is nothing more than the archived outcome of derived business rules based on Raw Vault. It is not a copy of Raw Vault, Business Vault is sparsely populated that serves as the implementation to resolve business process gaps in the source application. For example, a 3rd party source application may define that a home loan account has its own facility and an offset account may have its own facility too but the business only sees one facility that is an umbrella business entity for all related accounts (loan and offset accounts). The raw vault link will contain how the 3rd party source defines this relationship but the business vault link will have the business view of that relationship. It is not a replication of the source but fulfilling the business view of the business process. Additionally, like raw vault, the business vault is auditable and reuses the same loading patterns. Another use (instead of just plugging source-system gaps / defining a different view of the source) may be to centralize and create an audit trail of derived content that is automated and closely tied to data governance practices. It may also serve as temporary store of derived content until the raw source can supply that content instead. BV is based on RV, RV+BV=DV.
- "Automatically decomposing Units of Work into bi-relationship link table although the UoW had more than just the two business entity composition". Decomposing the original Unit of Work into two-part link tables break the UoW and in many instances may make it impossible to recreate the source if needed, a fundamental requirement for data vault, auditability. Besides why break the UoW only to have to recreate it downstream? Not only does this add latency to time-to-value but it also becomes costly in terms of query execution to break it up and reassemble but also the number of tables to join makes the solution perform poorly.
- Creating Weak Hubs. Hubs that do not track unique list of business entities or another definition is using too generic hub table names. The former occurs when we identify keys and assume that these should be loaded into their respective hubs to create a link to them when in fact such keys would be better defined as Dependent-Child keys in a link or satellite table. The so called "weak" key doesn't describe anything on its own and needs a "strong" key in order for it to have any meaning to anyone using that key. This is exactly what a dependent-child key is! The latter, using too generic hub definitions relates to overloaded content landed and needs to be modelled. We have seen this, a single business key column that has multiple business entities within it and a decision is made to load a hub_entity. The name itself doesn't convey any meaning and if you take a step back and think about what you have just created a is dependent-child hub! Without something to better define that hub key you need some code or category to define it! The overloaded business key column must be split either by (in order of preference) a) the source to provide the content split as independent files/tables to be staged or b) we have to split the content before staging. The former is preferred because it solves the integration debt upfront and the task of keeping this up to date and managed is with the source application. The latter means you must be additional code to split the content, and build in error traps to catch the moment a change is detected in rule to split that column. An overloaded column can also mean a need to implement row-level business key collision codes, more maintenance for in your data warehouse!
- Source system vault, a key understanding of data vault is the need to integrate multiple business processes by their natural keys which may be served by multiple source applications. If source applications share natural keys then the passive integration naturally occurs and the entire business process is tracked in harmony in the data warehouse. A great example is by treating transaction-style MDM as a source because its harmonized (match-merged) key is used to integrate multiple applications in realtime. No collision of business keys can occur (a single key value that could represent multiple business entities within the same hub). BUT as diverse as the business decisions are to purchase specific software to automate various portions of the business service, integrating the business processes into a single data warehouse will likely come across where such collisions to a single hub table could occur, only then do we use a business key collision code (BKCC) to ensure that loaded business entities remain unique within that hub. To use a collision code based on the source-system that provided the key, without considering that two or more source applications could be using the same business key value to represent the same business entity leads to a source-system vault, an anti-pattern. You will inevitably create more entities than you need to in the hub, more joins then required to bring the content together and more tables required in business vault to solve the integration debt you would have introduced if you followed this source-aligned loading pattern. Also remember that the appropriately applied business key treatments is performed (left+trim+upper) before hashing
- Staggered loads. Surrogate sequence key data vault needed to follow a similar loading paradigm as dimensional tables, first all the dimensions must be loaded before the fact table can be loaded. For Data Vault before hash-keys, the same was true, you needed to load hub tables before loading hub-satellites and link tables, you needed to wait for the link table to be loaded before you can load a link-satellite as each step needed to lookup the parent key in order to maintain referential integrity (foreign key constraint). This led to loading paradigms where loads were dependent on each other and could not continue into until its parent table had completed its load. Some implementations have gone as far as loading all hubs first, followed by all links, followed by all satellites. So much time is wasted (muda) waiting for table loads to complete that you could only really get away with this loading paradigm in over night batches! Surrogate hash keys meant that after the content has been staged that all data vault target tables can be loaded independently and still guarantee referential integrity either as post load automated orphan checks or as deferred RI. Each portion of the enterprise data vault is eventually consistent ensuring that the data does flow and does not need to wait for any other portion of the data vault model to complete its load before its own load can continue.
- Not recognizing Link-Satellites. As an extension of point 3 above a data vault model without link-satellites lacks the descriptive depth to describe a relationship. Instead in these "data vaults" a hub table is created simply to support a hub-satellite that describes the relationship. That hub table is what we want to avoid because it is not by (Oxford) definition a hub and it just adds more tables to include in the join, thus affecting performance to get the data out. In recognizing link-satellites also recognize that even if source data is supplied with multiple keys it does not mean that all the content must by default go to a link satellite. If all or part of the content is describing only one business entity then that content must be mapped to a hub-satellite (or split between multiple hub-satellites and a link-satellite), thus loading content to what it is describing through satellite splitting. Not performing satellite splitting forces the consumption of the data vault into queries that have to "re-envisage" descriptive-content change in-flight everytime that content is sought. In other words to get a business entity level-grain out of a link satellite you have to select distinct by the business key of the relationship in order to get to that grain everytime. Instead satellite splitting ensures that content is already appropriately propagated at the grain you need and that a change record is a true change. On the subject of satellite content, an old DV motto is "all the data all the time" and still rings true. If the source supplied 600 columns do the due diligence and split the content appropriately so that you can recreate the source and remain 100% auditable. For performance reasons split the satellite by PII, rate of change and by clustering together critical data elements as well as the other satellite splitting reasons listed above.
- Using anything other than string/varchar for recording business keys in hubs. Creating hub tables with integer/numeric data types simply because the business key has the word "number" in it does not mean the column is a measure. Measures are additive/semi-additive fields that are appropriately data typed to suite performing algorithms on those said columns. Business keys are immutable, remain consistent for the life of the business entity and are not values that arithmetic should be applied to. And even if you do you have now created a new key. On top of that ensuring that the hub business key column is of type string/character/varchar ensures that the hub table will least likely ever need to be refactored and that you can load any other data source to that column without breaking the hub table. I go to great lengths explaining why we cast business keys to string in this post Casting Business Keys and it is not only applicable to data vault it is applicable to data modelling in general. On the subject of data types, surrogate hash keys must always be delivered as raw/binary data types and all joins are performed in-database and not in the BI tool, especially those tools that are not built to handle complex joins and binary data types!
- Loading DV after PITs and Bridges
- Using a business date as a load date
- End dates in Satellite tables