We have person records that do not have enough information to uniquely identify individuals. Ideally, I imagine, it would be best to uniquely identify individuals based on Surname, Given Name 1, Given Name 2, Date of Birth & Gender but often we only have nicknames or incomplete full names (missing gender, dob, or a given name), sometimes all we have is the gender (“unknown male” for instance) making it hard to uniquely identify individuals. Adding to the complexity of using the name fields to identify individual people is the fact that people can change their name and our system stored aliases, and maiden names and nicknames of known individuals. We have two systems that collect name information (one more complete than the other) so that may further complicate the determining of business keys. In one of our systems we have a “name” field that contains all manner of data; “male suspect seen fleeing”, “Sid did it”, etc.
I’m struggling to determine what we should use as a business key because cleaning up the data isn’t an option. In the world of “the good, the bad & the ugly” this is truly ugly.
Our one system has the structure shown in the Source A figure.
The Id field is the PK and a system generated key that is overloaded (the first 3 – 5 characters are the entity type, then there is a section for the server and then the final ten or so digits are the “sequence number” part). This Id is obviously used throughout the database to link between entities (TBL_GOccIvGPerson is a link between Person and Occurrence where the Lid is the Occurrence Id and the Rid is the Person Id).
The Person table caches the “most recent primary name information” from the Person Name table, but each person can have many names (of varying types – alias, nickname, maiden name, etc.).
Our second source (Source B) has even less useful data in it. The Person table contains any manner of data in many of the data fields. The nme field can contain the Surname and first name as (Surname, Firstname; Firstname Surname, just surname, just first name, etc.) or can contain other less useful information (like “German Shepard” or “Man with knife”) and the DOB and gender fields may or may not be filled in (and sometimes contain data that isn’t a dob or gender). For this table person records are only in the context of a given event, which lends itself to the creation of duplicate individual records.
Is it possible to model both of these sources into one hub? Or how would you suggest modelling these?