This week I bring you a few definitions, and some interesting questions and answers that I’ve had over the years. We teach these concepts in our CDVP2 class, but probably not to this level of detail.
Effectivity Satellite = is a Satellite structure containing “begin and end dates” or multiple instances of begin and end dates where these dates indicate the “effective timelines” of the parent LINK record. These dates should be populated by business data arriving from the source stream, however if this is not available, the dates may then be populated by a system-driven process based on logic defined by IT (IT Soft Business Rule). There may be multiple effectivity satellites attached to a single link structure to accommodate multiple instances of date/time lines.
Best Practice: split Effectivity Satellites by business definition or line of business or source system, to keep a single temporal window per Effectivity Satellite. However, there is nothing in the standard that prevents the Effectivity Satellite from being denormalized and representing multiple time-lines in a single structure.
SECOND LEVEL STAGE
Second Level Stage = is a specific staging table or file that houses the system-driven fields such as: load-date, hash difference, hash keys, record source and so on, that are computed on load.
Best practice: create a 2nd level staging table or file of JUST the system fields and the primary unique identifier (assuming a record can be uniquely identified). This then achieves the following goals: move only the data necessary to accomplish the join in the next set of parallel processing steps, achieves maximum possible parallelization downstream, allows compute of the values to be executed once and physically stored for repeated parallel use downstream.
Notes on Second Level Staging:
Q) Are you required to setup Second Level Staging?
a) No. If you are using an ETL tool or a real-time messaging stream tool and can “compute” these values, and “stage the data once” with these values computed, then this is the recommended practice.
Q) Where should these values be truly computed?
a) at the POINT OF ORIGIN of the data (ie: the true source system), before the data is exported, before the data is sent to the messaging stream, before the data is replicated. This is the true correct place to “add these system fields”. Why? they enrich the metadata and become attached to all data streams around the company. But: if you do not have control over the source system, and cannot request these fields to be added – then 2nd level stage is your next best option.
Q) What is the driving factor for these design changes?
a) Future Proofing – avoiding more re-engineering due to MASSIVE volumes of data being pushed around the organization. Eventually “data will become too big to move (at least more than once)” . Note: I’m talking about delta driven data identified as a change record in the source system, or transactional history (ie: machine generated data that never changes, just issued as “new transactions / new events”)
Q) Where can I find out more about all of this?
a) today: CDVP2 class talks about some of these concepts. We are getting ready to launch: Advanced Virtual Data Vault concepts that discuss extreme levels of data and further optimizations, so watch our site for more exciting news!