Notifications
Clear all

If you're using the Base64 function to help perform table joins in DV 2.0, you’re doing #datavault wrong!


Posts: 614
Topic starter
(@patrick-cuba)
Honorable Member
Joined: 1 year ago

I thought the point needed to be highlighted. This is something I encountered onsite and even in a Snowflake hosted Data Vault webinar! The use of the Base64 function in PowerBI; for that matter any tool used to produce analytics from the data landscape.

What does it mean to be using Base64...?

Best practice for using surrogate hash keys is to store them as binary / raw. Why? Binary as opposed to varchar byte lengths use half the byte width of varchar; and therefore perform better in table joins then using varchars. You do not need to worry about collation and extended character sets in a binary data type.

However, tools like PowerBI and Talend do not have a binary data type so in order to perform said joins the binary hash keys must be converted to varchar using the Base64 function. 

Ah but this highlights an even bigger problem, and it is not a data model problem, it is architectural. 

-- NO Business Intelligence tool should be used to perform Data Vault joins!

-- Secondly, the act of performing Data Vault joins using PowerBI and converting binary columns to varchar means that all of the data participating in the join is being pulled into PowerBI and then joined. Imagine the network carnage and the cost to move data across let alone the latency to analytic value!

Using Base64 is a symptom of a flawed Data Architecture. It is the proverbial using the badade used to treat the symptom rather than the cause.

What should we do instead?

-- For tools like Talend the work should be done "in-database" (predicate pushdown in other parlance), nothing should be done on Talend

-- For tools like PowerBI the work of curating, conforming and eventually producing information marts for reporting should be done way before BI Tools are even considered. For PowerBI specifically if the workload is SQL specific the work should be done in-database. If however there are DAX functions that cannot be done in-db then consider limiting the amount of data that needs to be transferred over to the PowerBI server, but even at this level no hashkeys should be needed. At this stage the data being pulled are information marts with appropriate filters applied.

Let's be frank

DV has many joins, these (let alone the hash keys) should not be exposed to the BI Tools.

It's partly why DV2.0 prescribes the need to understand architecture, along with agility and modelling. It's not enough to know that DV is made of hubs, links and satellites; it's important to know how properly use and apply them.

Reply