I'm a software engineer new to DV (and BI in general) and I'm excited to be here to learn and contribute!
I'm currently putting together a proof of concept for a data platform for a VC firm and I'm halfway through the DV2 book.
The central entity in investing is a Company (the company is the customer). We have internal systems (Salesforce mainly) and a couple of purchased data sources used to enrich the Company information further.
We don't have control over the external/purchased data sources but we rely on them heavily in order to identify and assess the best opportunities.
I'm trying to identify the BK for the companies in the Raw Vault. In the business and most source systems (including purchased data), we generally refer to the company by its first-level-domain (e.g. google.com) but there are a few problems.
A startup can change its website over its lifetime (rebranding).
A domain can represent multiple startups over multiple years (startups tend to die).
There are other combinations that can make the identity closer to unique (such as Year Founded or Founding CEO) but these fields are incomplete (maybe 80-90% coverage for year founded and less for CEO).
How do I deal with this?
Is it sufficient to do to use the most complete BK combination as a master record and use Same-As Links to point all the other ones to the master through some business logic at the BV layer?
Would it still be fine if the domain changes or due to incomplete data, another company's info gets associated with the wrong hub record?
Any guidance would be much appreciated! ?