I received the following series of questions from one of my former students recently, and thought that someone else may have the same situation or questions. I've added numbers to the questions originally posed so that my replies will align to the questions posed.
1) We have this existing Data Vault EDW and information mart based on SAP data. As you may know SAP has really wide tables (or entities) and lot of columns in them. At the moment certain sets of column have been selected out of these source entities. Columns have been renamed using more human readable form in the EDW. The naming might be one reason why the extra columns have been left out of the EDW because it takes so much effort to take everything from the source (especially when using custom accelerator). Small tables (<100k rows) are loaded every day as full load and compared to the existing EDW. For the bigger tables there are deltas based on created / modified date fields (hard deletes come from SAP audit tables)
In this certain case, what might be the best practice to add new columns into those existing source entities? This is not really a new source in that case. I suppose you would always create a new satellite. The source is the same entity but we're just adding new column(s) to it. If we end up modifying the existing satellite, we would have to drop the constraints from these new fields because the preceding load date rows won’t have any values for the new column(s). Tables where a full load is executed, the hashdiff values will change for every row resulting in a full reload of all of the active rows again in the EDW. Also, when timeline gaps are created, whether by creating new satellites or altering the existing satellites, it becomes nearly impossible to align the data in the Information Mart. You have to know the specific date when the data for the new columns was loaded.
2) Another best practice question is about the naming. I have always considered the EDW as black box and therefore the naming convention doe not require that these columns be human readable. Technically, only developers/IT operate in the Raw Data Vault. The naming is, in my mind, something that could be done later in views, Business Data Vault, or in the Information Mart itself. Am I wrong?
3) What is the best practice for those hard deletes? It would easily generate unwanted updates in tables if you use some kind of end date. Is it better just duplicate the last row and mark it as ‘deleted’=1 and give it new load date. When you are generating those load end dates using a lead function, you could check for the deleted value and use the load date as load end date, or is there another approach?