I may be opening myself up for criticism here but having worked on one or two DV projects, one of the complaints I often hear is that querying data from the vault can be tedious because of “all those joins”. I get that, and I too have spent many hours coding up satellite-to-link-to-hub joins wishing it was easier.
Logically separating key, relationship and contextual data makes real sense, but do those concepts have to be separated physically. Why not collapse the structures down to minimise the number of joins required? I have recently asked this question of Kent Graziano (sorry to name drop) when he visited our office and, surprisingly, he didn’t say we shouldn’t (equally he didn’t say we should). I also asked the instructor on my certification course (Nols Ebersohn), and he suggested that “it wouldn’t scale”. I didn’t think to follow up and ask in what sense it wouldn’t scale. Are we talking in terms of database performance, or the model itself? Here is what I’m thinking.
Say we ingest some orders from a sales system, and we have hubs for Customer, Order and Product each with a hub satellite; and then a link across these concepts representing a Sale with its own satellite for order line items; eight tables in all. It might look something like this.
Here we would need a minimum of 4 joins to reconstruct the sales item data (more if we wish to include customer/order/product satellite data). How would this perform? Now consider what this structure might look like if we were to flatten it by collapsing the hub and link information into their respective satellites:
The hubs and links have gone, but the information they represent remains. Business keys are still hashed allowing efficient joins across tables, but the equivalent query to retrieve the sales item data no longer requires any joins, as all the data we need is in the satellite. Even if we wanted to include customer/order/product data we’d only need 3 joins (compared to 7 in the traditional approach).
Note: The model has also recorded the natural business keys in the structure to explicitly document how the business keys are formed and hash keys calculated. This is entirely optional, and depending upon the database platform, we might choose to utilise hash keys, natural keys or, in this case, both.
The only downside I can see is that it isn’t easy to answer the question, “How many distinct customers are there?” Of course, this can be derived, perhaps not easily, but how often do we need to answer this specific question?
So does this conflict with the Datavault 2.0 standard?
The method remains the same. The flattened structure is an implementation option, nothing more than a switch set in the automation engine (assuming it is supported). We either choose to flatten or we do not.
Again, there is no change. Modelling would be approached in precisely the same way. We still need to identify business keys and integrate around them. The difference is that this is a logical model which can be physicalised as hubs, links and satellites or as flattened satellites. It’s an implementation choice only.
Here there are some differences, but not to the detriment of the standard. For example, loading patterns suddenly become much simpler. In fact, as there are no physical relationships, no referential integrity needs to be enabled on the host database – meaning it works in exactly the same way on NoSql platforms. This means that all flattened satellites can be loaded simultaneously without any dependency on hub and link loads. Orchestration then becomes much simpler.
So, it’s easier to load the raw vault, but how will this approach perform getting the data back out? The grain of the satellite tables is the same so the number of rows is identical, however, the width of each row will be increased with the additional key columns. Will the reduction in joins outweigh this widening of the satellites? I can’t say but I suspect it might.
Is this approach immune to re-engineering? Well yes. In fact, it could be argued that it is easier to re-engineer. Consider the scenario where there is a need to change the hashing algorithm, maybe going from an MD5 to a SHA2. With a flattened structure this is almost trivial. Either by copying each table to a new version and recalculating the hashes on the way through; or by altering the hash columns (if they need widening) and recalculating the hashes in place. Would this be as straightforward when hashes need to be synchronised across hubs links and satellites?
I’d be surprised if this flattened structure approach hasn’t been considered before, and perhaps some of you have had experiences you’d like to share. I guess I’m looking for definitive reasons not to take this approach, because for me the advantages seem compelling.