Recently I heard about the CDVP2 hands-on materials, and the CDVP2 materials were not representing normalization or denormalization in the right light. I wanted to take a minute to refresh understanding around what these things truly are, how they are defined to be. We do use / apply the terms properly, and often times students get these mixed up or backwards.
Start with some definitions:
Database normalization is a data normalization process that reduce data to its canonical form.
Scott Ambler says:
Data normalization is a process in which data attributes within a data model are organized to increase the cohesion of entity types. In other words, the goal of data normalization is to reduce and even eliminate data redundancy, an important consideration for application developers because it is incredibly difficult to stores objects in a relational database that maintains the same information in several places. http://agiledata.org/essays/dataNormalization.html
A Text Book for your reference:
Normalization is the branch of relational theory that provides design insights. It is the process of determining how much redundancy exists in a table. The goals of normalization are to:
- Be able to characterize the level of redundancy in a relational schema
- Provide mechanisms for transforming schemas in order to remove redundancy
Normalization theory draws heavily on the theory of functional dependencies. Normalization theory defines six normal forms (NF). Each normal form involves a set of dependency properties that a schema must satisfy and each normal form gives guarantees about the presence and/or absence of update anomalies. This means that higher normal forms have less redundancy, and as a result, fewer update problems.
Here are several good references that walk through how and what normalization and denormalization are:
Where does Data Vault Reside?
Data Vault Modeling resides approximately between 3NF and 4NF. However that said, Hubs (with just a list of business keys) would reside in 6th normal form with no repeating fields. When we add a hash primary key or a sequence primary key, the Hub is still in 6th normal form with no repeating values or columns. When we add load-date-time stamp and record source, then the Hub drops to 3NF. That begs the question: does the presence of a repeating system field value dictate the normalization level of a structure? I would argue no. However in the purest sense – the answer is yes because the definitions make no distinction between system driven fields and other data. So, the hub sits at 3NF to be precise.
Links, are closer to 2NF (second normal form). Why? they have more repeating values, especially around parent-child relationships. They are not 1NF because they still rely on joins to additional parent table structures.
Satellites well, they start at 3NF if they are designed to be separated by grain or rate of change, AND if they are designed by a singular record structure (what we call scalar values in each column). IF a Satellite is designed to house JSON or semi-structured data sets (XML or otherwise), then a Satellite may be classified as 2NF. Why? because it contains additional levels of details that denormalize joins to other tables, in reality it contains multiple granularities where repeating values are many – especially if you have nested arrays in the JSON or XML structure within a single Satellite.
Point-in-time and Bridge Structures sit at 2NF when first designed. They can be changed to sit at 1NF (so that they contain all the data a query needs without joins) – this would be an example of a physical modeling adaptation to meet the platform specific needs of Google BigQuery for example.
We hope you’ve enjoyed this journey down normalization lane, and we hope you find it useful for your understanding of Data Vault.