Call it back-dated data, call it real-time feeds, call it old data arriving late, call it what you will. The problem has persisted since day 1 of systems creating, feeding, and sharing data. But the problem is deeper than just systems. The problem lives on whenever business users can and must back-date data.
What does this really mean?
Back-Dated data arrives out of sequence in the data warehouse, integration system. It’s known by many names:
- Back-Dated Data
- Late-Arriving Data
- Out of sequence data
- Delayed data
- Previously missed data
- Data that got held up in an operational process somewhere
You get the idea. The point is: the data arrives in the warehouse at one date-time, but was generated / produced or moved to a different date/time line.
Why does this matter?
This matters because these are business driven date-time lines that are important to queries, and the BI / Analytics deciphering and use of the information. It can change the values of aggregates and the detection of when a delta has occurred on a data set.
What are others in the market saying?
Some have argued that changing the load date is the right way to address this problem. Others have written articles about “new table structures” that claim to solve this problem, but these articles only address what happens in a batch environment; unfortunately they completely miss what happens in real-time. These other articles and statements ignore things like: scalability, restartability, duplicate detection during restartability, and so on. The suggestions offered by these other authors in their articles do not adhere to the DV2 standards and will cause your system to fail / break down in a number of ways going forward.
How does this relate to Data Vault 2.0?
Changing the meaning or definition of LOAD DATE is neither wise, nor acceptable. The load date is, was, and always will be the “date and time of insert of that piece of data to the target system.” Re-ordering the data in a batch sequence on arrival is also not an option, as it is not scalable – especially in real-time with the arrival of “out of date” data to the tune of hundreds of millions of transactions per second.
What is the answer then?
The answer is simple: Add another temporal based field. We call this field: applied-date. We’ve been teaching this concept, discussing it in CDVP2 since the beginning. It’s the proper method and the only method to setup the correct sequencing of data in accordance with the business, or in accordance with the late-arriving data.
What is APPLIED-DATE?
Applied-Date is a concept. It is defined to be an attribute that houses a time-line different than that of the load-date itself. You are not limited to utilizing, adding a single applied-date. You should feel free to add as many applied-dates as necessary (just name them functionally / appropriately) – so that the BI queries, or the information mart builders can execute logic against them properly. In the case of a late-arriving data set for deltas, this may mean your Satellites in DV2 do in fact store repeated deltas of values changing. That is the fundamental job of the Satellite and it is working as designed. To change this behavior for any reason is to depart from the standards.
In other words – leverage the one or more applied-date timelines in your out-bound queries (when loading PITS and BRIDGES or views downstream) to re-assemble your data set in the order according to the business. I am also telling you there is value in understanding the patterns of late-arriving data (how frequently it happens, what percentage of data it happens to, where it’s coming from) because: it may be a broken business process, it may be a busted operational machine, it may be a problem for the way business works, OR it may just be normal operation. But you cannot deduce what it might be unless you are properly capturing load-date sequencing.
Finalizing our thoughts…
LOAD-DATE is LOAD-DATE, for the data arriving in real-time perhaps the applied date is the “date/time stamp of the transaction”. No extra table structures are needed, no funky processing is needed, no change to the standards are needed, no band-aids are needed. Just simple SQL with an order by applied date where window of time is of interest to the query being executed.
Doesn’t this affect downstream aggregation?
Yes – it always has, it always will. So we’ve got a technique to manage that called incremental aggregation – these are all concepts we teach in CDVP2 (Certified #datavault 2.0 Practitioner) class. But more will be coming on these thoughts going forward.