Here are some steps that I recommend every team take when running a Data Vault project / sprint / iteration:
NOTE: FROM step 5 to step 12 should take you about 6 to 8 hours the very first time you do it AND without an automation tool. If it takes you longer, then you've either included too much in your scope, or you've built something incorrectly and need to brush up on your standards (reapply the standards taught in the book: Building a scalable data warehouse with Data Vault 2.0)
Please make sure to read all through the conclusion. Leave a comment? tell me what I missed? what you like? did this work for you? was it helpful?
- Gather a basic use case first from the business, some simple statement like: "we need HR data"
- If the business will refine the use case and give you specifics to start with, then that's wonderful - start with that. If the business doesn't know what they don't know (or can't figure out what to ask for), then the next steps are for you to decide.
- SCOPE SCOPE SCOPE - reduce the scope. HR data generally means: Employees (for example). So, in your system, choose the most important component of the source application - find out where the data lives, and source that (just those tables). Limit the original sprint scope to 4 or 5 source tables and relationships (at a maximum). Note: If you want a 2 hour iteration cycle, then limit it to 2 source tables at a minimum.
- Setup a full separate and complete independent sprint cycle / iteration for profiling. Do not include profiling as a part of your "development effort". All profiling takes time, all documentation of metadata, meaning, and understanding takes collaboration and participation from the business. These should always be separate sprints. Some profiling activities (unfortunately) can last a few weeks - when they should in reality only last a few hours.
- Your biggest outcome from profiling should be: establishing correct or proper business keys, then, establishing the relationships, and I'm not talking about defaulting to source system surrogate sequence identifiers. Sure, sometimes those are utilized - but only as a last resort (learn more about this in our CDVP2 courses)
- Now: RESET the scope (narrow it down) to 2 or 3 business keys / hubs, 1 link (at most), and 2 satellites (perhaps 3).
- Build the DV target model with just those elements
- Then: Load the staging tables - get the staging area bulk loaded (if it isn't done already)
- Create views against the staging tables that produce / generate the hash keys, and hash differences (if not already assigned) FOR JUST the hubs, links and satellite targets you identified in step 6
- Load the Data Vault target model from the views
- Check your work: run the loads a 2nd time - they should load ZERO rows if they follow the standards properly, and match the process design indicated in the book: Building a Scalable Data Warehouse with Data Vault 2.0
- If they don't load ZERO rows for a 2nd run, go back and refactor - you missed something in the pattern, FIX IT before propagating it elsewhere.
You have just completed your first iteration of Data Vault build process & load cycles.
NOW SWITCH GEARS:
you need to now put on your BI hat / information delivery hat / role... This is critical - your next iteration (should take about 4 to 6 hours the first time you do it / learn how - without an automation tool)...
- Design, Build one PIT table structure (should take you about 30 minutes max - if you already understand what a PIT table needs to look like)
- Build a single view (to start with) that left-outer joins ONE HUB to ONE Satellite and retrieves the current row only.
- Test it, it should not produce any cartesian products.
- CHOOSE A SINGLE BUSINESS KEY IN THE HUB FOR MOVING FORWARD - this is absolutely critical for unit testing, and to remain agile in your build process of building the load for the PIT table.
- Add the single business key to the WHERE clause to restrict it - your view should now return only 1 row.
- If it returns a single row, you are good to move to the next step. If it returns multiple rows from this join, go back and re-test, fix the view - you probably left out the "get current record / latest record" subquery in your left outer join.
- Add another Satellite (LEFT outer Join), and grab the current record. Again, the view should return only 1 (one) row in total. If it doesn't then, you've made another mistake, and introduced a Cartesian product that should not be there. Most-likely the same error as stated in step 6. Go back and fix it before moving ahead.
- Now, write the INSERT INTO for the PIT table, rememeber: START with just the key structures, and the business key. Never ever replicate "data from satellites" or never add computed fields unless they provide a measurable performance improvement to the BI query.
- Run the INSERT INTO with a single row that you've been working with, check your results, fix it if necessary.
GET READY: WRITE DIMENSION, FACT, OR FLAT WIDE VIEWS (next steps)
- Next: WRITE A DIMENSION OR FACT OR FLAT-WIDE VIEW on top of the PIT.
- Use the PIT table as the DRIVING table for your query.
- INNER JOIN to the Satellites by primary key from the PIT table. Start with ONE satellite join, and run the query - if the query works, and returns 1 row (the 1 row you loaded), then you've got the pattern correct.
- If the query returns more than one row, then you've either added a left outer join back to the satellites (which is wrong to do), OR you've not fully qualified the join with the entire primary key in the PIT table.
- Once the query is returning a single row properly, add the other joins to other Satellites and re-test.
- As long as the query continues returning EXACTLY one row (that's all that we loaded to the PIT table for now), then your query is correct.
- Once all the joins have been created, you can add Group by, having, sums, computed fields, to the view output.
- Add these aggregations (if necessary - but at this point I recommend you leave all of that to the BI Tooling - for the most part). and iterate.
- Open up your BI tool, and "source the view" - make sure you can see data from the view.
FINAL STEPS: Open up the PIT Load Process
- Go back to your PIT load process, and remove or comment out the Where Clause on the single business key
- truncate the PIT, and re-load it completely for all keys you wish to have.
- Then, go in to the BI tool, and test the data set, test the performance of the view and the queries underneath. Make sure your raw data is being delivered accurately to the BI tool.
This entire set of processes is the iteration pattern. Once you are familiar with this pattern (this is a WORK PROCESS PATTERN) then you should be able to repeat it rapidly (within 2 to 4 hours max of each iteration), and if you leverage an automation tool: WhereScape, erwin, VaultSpeed, Varigence BIMLFlex, etc... then that time frame should shrink even further.
If you struggle to achieve these types of results within 4 to 6 hours (even as you learn), then something isn't right. You've either changed / broken the standards, or you've not understood the objects you are using, or you've over-scoped (most common), or you've gotten caught in analysis paralysis (easy to do).
These are the most common mistakes people make on their way to learning how to make Data Vault work for their organizations.
REMEMBER THE FOLLOWING:
even without hard & fast BI requirements you should still be able to produce raw data results all the way to the BI tool within a single day. When you get good at this, it should take you less than 2 hours to ingest a new source - AS LONG AS - the new source is known / profiled / defined before you start.
The point is: once you show business users output - especially raw data - they will finally be able to tell you how they need it formatted, how they want it aggregated, etc... give you the BI business use cases in exacting language.