A service a DV2.0 consulting house may provide is evaluating an existing data vault and how it fits to DV2.0.2 standard.
What are the things you would look out for?
1) Table structures
Standard tags for the following:
- Hubs - unique business entities - hashkey, recordsource, business key collision code, business key(s), applied date timestamp, load date timestamp, multi-tenancy id, task id
- Links - relationships between business entities - hashkeys, recordsource, applied date timestamp, load date timestamp, multi-tenancy id, task id
- Satellites - hashkeys, recordsource, applied date timestamp, load date timestamp, multi-tenancy id, task id, hashdiff, all source attributes captured
Do not use business date as an applied date. Obscures the very nature of capturing history.
Imagine a call center with a customer asking for policy change only applicable in the future. This date should be captured as an attribute in the satellite (and is a part of the hashdiff) and nothing more.
exception: satellite with no hashdiff only because the ETL tool is doing a column by column comparison or the source is a CDC.
Only relates to Hubs as Hubs set the domain of the business. Planned hubs should reflect what the business interprets as entities of value. From a single source depicting a relationship between a single domain and the same single domain a same-as link or hierarchy link is built. Ex:
- HUB_ACCOUNT reflects the financial relationship with a customer. Customer may have a home loan account or credit card account both may be represented by a single hub.
- HUB_CUSTOMER reflects the customer and it can be multi-party such as organizations or persons or they could be separate hubs depending on how the enterprise views the customer.
The rules are strict; they are business entities that represent some value to the business. Modelling anything else what the source system has defined as primary keys is an operational data vault.
Potentially not business entities (just an example) but the idea is that the vault reflects the domain to which the vault was built for. Unique business entities.
3) Source vs Vault
Did we capture the source for what we need? Did we super type business entities? Hopefully not.
Capture the source in Raw Vault at the same grain as represented by the source, no different. Do not enforce conformity on the raw vault because the raw vault will not flex as the business matures. A requirement for vault is that it is agile, not only in practice but in model. If the underlying application source model changes then (if needed) the raw vault will change, be it through schema evolution or modeling new tables. Raw Vault survives and flexes as the application landscape matures.
4) Decoupled architecture
A true data vault is agnostic to vendor implementing it. No matter the tech behind it (SSIS, Informatica, DataStage, Talend, Fivetran, Wherescape et al.) it populates the same target structures. Should the customer migrate to a different tool would the model survive?
Marts built on top of data vault and if needed aided by PITs and Bridges. PITs and Bridges provide Query Assistance and only exist to enhance query performance and therefore never drawn as a part of business vault. BV derives new business value, PITs and Bridges do not. Does the model pass this test?
Do the loading patterns meet the criteria of:
- Hubs - only new records loaded by new mltid, bkcc, hashkey businesskey
- Links - only new records loaded by new hashkeys, mltid
- Satellites - only new records loaded by new hashkeys and/or hashdiffs, maybe mltid.
6) Business Vault
Implemented as physical data structures recycling the loading patterns utilized for Raw Vault.
RV captures the application (OTC) business rules business vault captures the derived business rules. Together they form data vault.
Is this the implementation? Does the customer implement a BV as views and understand the implications of doing that? (hint: no audit trail)
SHA2 is the current standard, has the customer chosen
a. Hash keys
b. Natural keys
c. Sequence keys (retained/persistent keys)
and understand the implications of each?
Are the keys standardized? meaning are they universally lower cased or upper cased depending on your preference?
8) Do they have a hash collision strategy
If hash-keys are utilized then a pre-load check is required on:
- vertical check - do i have a hash-key collision check within the staged file? Two different business keys generate the same hash key (- let us know if you find this, write a white paper, eureka!)
- horizontal check - will loading the business key + hash key to the target data vault create a clash where the different business keys generate the same hash key.
9) As of DV2.0 are the data loads in parallel?
A major advantage over DV1.0 is being able to load everything in parallel, does the architecture and model meet that standard? Being able to load ANY part of the model at anytime
10) Has the model matured enough to provide the source system business intelligence required of the source systems?
- Do they have Status Tracking Satellites?
- Do they have Record Tracking Satellites?
- Do they have Effectivity Satellites?
11) Have they built PITs for their intended purpose?
Lean & Mean these table structures take advantage of index on index joins and therefore it should be best practice to keep them small enough to be useful and reserved for their intended purpose. Anything more than that and they become data marts
12) Do they maintain AUDIT?
no details needed except if data is deleted for any reason other than regulatory compliance (ex. GDPR) then it is not a data VAULT.
13) Is the data vault consistently reconciled to the data that loaded to it?
Requires an automation framework that executes quick and easy recon scripts after every load and maintains trust with the business. If recon never fails how can anyone argue that the data vault is wrong?
14. what criteria would you add or change to the list above? What makes a real data vault?