Hi Everyone!
We want to encourage you to engage in helping evolve the Data Vault Standards. While this is not a democratic process (as we don’t believe in voting for standards), this must be an evolutionary process. In that light, we want community suggestions. However, the proper rigor must be applied to the new standards or change suggestions that are put forward.
In this post, we share some of the requirements for suggesting new standards to Data Vault. Please remember they can be standards for: Architecture, Methodology, Modeling, and Implementation.
Please remember that there are 30,000 test cases in place for the basic Data Vault standards that are offered to the market. Only standards that: Stand the test of time, are NON-CONDITIONAL (work regardless of condition), Scale in both batch and real-time) and are repeatable, will actually work for the community.
For now, if you want to read a bit, you can read a bit more here: DanLinstedt.com (my blog)
For Architecture Suggestions:
- Must incorporate a Hybrid architecture of NoSQL and Relational platform
- Must have clean splits as to the function of each component
- Each component must be well defined, succinctly defined, easily understood
- Components can be: Source Systems, Staging Area, Landing Zone, Data Warehouse, Business Warehouse, Information Mart, Data Science Area, Operational Applications, Real-Time Message Queues, Master Data Management Solutions, and ODS (although ODS is on its way out and very rarely used these days)
- Must allow data to flow bidirectionally across multiple pieces.
For Methodology Suggestions:
- Must work well with people, process, and technology
- Can be suggested for either people, process, OR technology
- Must have succinct, clear, non-redundant definitions.
- Must be easy to repeat
- Must be pattern based
- Must be optimized at CMMI Level 5
- Must be measured (have Six Sigma KPI results) to prove that it is optimized
- Must be business focused (don’t forget: IT is a Business!!)
- Must be well-defined, succinctly defined, only non-redundant definitions are accepted.
- Must be a methodology based standard, and NOT a framework standard.
- Must have TQM KPA’s identified, and TQM KPI’s for measuring success
- Must PROVE (through KPI’s) that Cycle Time for build / use / application is reduced.
- Must work with split parallel teams, or global teams
For Modeling Suggestions: (Standards applying only to DV modeling)
- Must be simple, measured with complexity ratings (KPI’s for maintenance)
- Must work in batch mode for loading
- Must work in Real Time mode for loading (without changes to the structure) – up to 400k transactions per second. The requirement excludes Deadlock contention on inserts, that is a function of the database platform and not the modeling paradigm.
- Must work with sequences, hash keys, and natural business key architectural designs
- Must not change the base definition of a Hub, a Link, or a Satellite (for a new object type, a new definition must be applied). I highly recommend to offer an “extended object” or change the application of the object rather than changing the actual definition of the base objects available. For example: Hierarchical Link is one such extended definition, as is Transactional or Non-Historized Link.
- Must work in BIG Data sets (>300TB of data)
- Must be placed in either Raw DV or Business DV
- Must be repeatable
- Must be pattern based
- Must enable easy back-up and restore
- Must support Change Data Capture
- Must not break the foundational definitions of each of the core Data Vault objects.
- Must work in a LOGICAL and CONCEPTUAL manner. Please note: Implementing DV in a physical model, on some platforms like MongoDB require changes / denormalization at the physical level – the end “collections” don’t really resemble pure DV objects.
- Must fit within a hierarchy (ontology / taxonomy model)
- Must NOT have conditional design rules. What works for ONE case must work for all.
For Implementation Suggestions:
(this would be the processing standards)
- Must work with small and large volumes of data (1TB to 500 TB)
- Must have metrics results proving viability
- Must have defined test cases (KPA’s) and results (KPI’s) of those test cases presented.
- Must work for Real-Time and Batch loading without conditional changes to the processing streams, or process design.
- Must be repeatable
- Must be optimized
- Must be fault-tolerant
- Must be restartable (WITHOUT CHANGING INCOMING DATA SETS!!)
- Must work with Change Data Capture
- Must be tested against multiple platforms, and with multiple tools. For example: what works in Teradata, must also work in Oracle, must also work in SQLServer WITHOUT changes to the standard!! NOTE: If it’s a new NoSQL platform like Neo4J, and if it’s a physical implementation standard it may be directed solely at a single platform. However please note: the more the standard is focused on a single platform – the less it is a standard, the more it is simply a “best practice” for that particular platform – so there IS a DISTINCTION between best practices and standards.
- If it works in C# it must work in Perl, Ruby, Python, Java, JavaScript, SQL, etc…
- Must be tested for backup and restore
- Must meet CMMI Level 5 Optimizations
The whole point to Data Vault standards for implementation is to be platform agnostic, technology agnostic, repeatable, fault-tolerant, and scalable. Again, if it’s platform specific – then most likely it is a best practice for a specific platform and NOT a standard.
Questions to ASK yourself about your Idea:
Here are some questions I typically ask of the “suggested change / new standard”:
- Does it negatively impact the agility or productivity of the team?
- Can it be automated for 98% or better of all cases put forward?
- Is it repeatable?
- Is it consistent?
- Is it restartable without massive impact? (when it comes to workflow processes)
- Is it cross-platform? Does it work regardless of platform implementation?
- Can it be defined ONCE and used many times? (goes back to repeatability)
- Is it easy to understand and document? (if not, it will never be maintainable, repeatable, or even automatable)
- Does it scale without re-engineering? (for example: can the same pattern work for 10 records, as well as 100 billion records without change?)
- Does it handle alterations / iterations with little to no re-engineering?
- Can this “model” be found in nature? (model might be process, data, design, method, or otherwise, nature – means reality, beyond the digital realm)
- Is it partitionable? Shardable?
- Does it adhere to MPP mathematics and data distribution?
- Does it adhere to Set Logic Mathematics?
- Can it be measured by KPI’s?
- Is the process / data / method auditable? If not, what’s required to make it auditable?
- Does it promote / provide a basis for parallel independent teams?
- Can it be deployed globally?
- Can it work on hybrid platforms seamlessly?