Clear all

Can we have (a little) temporality in the LINKs ?

Posts: 2
Topic starter
New Member
Joined: 1 month ago

Let me run one concept by you.

Use case:
We are trying to build a raw Data Vault and recently we had a discussion in the team around LINKs. We realized that in ~90% of the cases, our users don't care about the validity of the relationship in the particular time period, all they care is which relationships are valid/active right now. At the same time developers using our warehouse (for views/reports development) are already a bit upset with using so many joins. They know it's a Data Vault thing and there are many benefits coming with that, so it's a fair price to pay, nevertheless we were thinking whether we can simplify things in this particular case, and maybe get rid of the effectivity satellites (and additional joins with them), or at least leave an option for either using the effectivity satellite if needed, or using something simpler instead.

The solution we came up with is: using the effectivity satellites for all of the use cases, in which we can establish, it is (or potentially will be in the future) necessary, useful, etc. to have the "from/to" dates available for a given relationship
and on top of that we will add to all LINKs (or some of the LINKs) the 'is_Active' flag, which will mark the record as active only if it can be found (and is active) in at least one of the data sources (source systems) for a given link.

How it may be implemented:
We separate the delete (marking record as inactive) operation from insert operation and preform them a bit differently.
Delete - we filter records by the source system when comparing src records to trg records (and looking for records that are no longer active/found in given source system) and thanks to that we deactivate only records related to this particular source system in a given link.
Insert - next we do the insert operation alone, this time not filtering the LINK by the source system, but only by the 'is_Active' flag, and we're checking whether given record (relationship) is already found in the LINK (doesn't matter from which source system) and we're inserting the new records as we'd normally do.

- If we apply 'is_Active' flag instead of effectivity satellites, we end up with less objects in the model and therefore a simpler queries and maintenance
- If we create the effectivity satellites for every link and add 'is_Active' on top of that, we enable the warehouse users to choose which option is better for they particular use case

- Potentially more complex data processing
- If we don't implement effectivity satellites everywhere, the business may change and will start to require the "from/to" information, then we're missing the history
- If data is coming from 2 or more systems, with the 'is_Active' flag alone, we only know that relationship is active somewhere, we cannot derive whether is active in both systems at the same time, or rather just in one of them
- In many databases 'is_Active' will break the primary key (we are using Snowflake which does not enforce the Unique keys, so we need to cover the uniqueness with other methods whether we use flag or not)


I'm really interested in your opinion on this. Obviously I'm aware that any temporality in LINKs is discouraged both in the 2.0 book and in the standards, nevertheless we did discuss it and we didn't find a major major issue with this approach, at least the one that would completely break the warehouse.
Are we missing something important here ? I do admit, in the data vault modeling, the HUBs and SATs seems fairly easy and straightforward (which I really like - I believe simplicity brings a great value in IT systems), but the LINKs feels a bit tough.


6 Replies