By Cynthia Meyersohn
Last August I was asked to participate in an exercise to assess whether or not the Data Vault 2.0 System of Business Intelligence (DV2) still held value. Specifically, I was asked the following:
- Does the size of the data now being delivered to the business make a compelling argument that the data should not be moved around as much?
- Do we truly need a physical data warehouse layer with the advent of a giant PSA and Schema on Read? Can’t a data warehouse persist on a PSA with Schema-on-Read and some tagging, and still provide the business the value it deserves?
- Is the only drawback to a giant landing zone versus a PSA the lack of delta processing?
- Can’t we leave the pattern recognition for repeatable structure and process to be discovered by mining engines/neural nets?
The following article was referenced by my colleague as an example of what can be built using a PSA and Schema-On-Read with an implementation similar to a Kimball Bus Architecture.
“Build a Schema-On-Read Analytics Pipeline Using Amazon Athena”, by Ujjwal Ratan, Sep. 29, 2017, on the AWS Big Data Blog, https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/ . (Ratan, 2017)
The article is a bit dated (Sept. 2017), particularly based on the rate of change in technology that is occurring today; but since this is the article that was referenced, I decided to just go with it. Why? Well, primarily because this article and articles like it have been proliferated all over the net from a myriad of vendors, technologist, programmers, business intelligence analysts, data scientists, and the list goes on.
So why am I adding to the noise? Mainly because both IT and non-technical managers as well as C-level executives are bombarded constantly with a demand from their staffs to purchase the latest and greatest shiny objects (software and solutions); many of whom read articles similar to Mr. Ratan’s and decide to replace their current solutions with what appears to be the newest “silver bullet”.
I’d like to “open the kimono” (so to speak) and give a different perspective on these technologies that isn’t necessarily obvious from the articles penned by their proponents. My hope is to provide some good ole’ boring rationale that IT managers and C-level executives may, hopefully and in time, come to appreciate and perhaps even consider before jumping into the Data Lake, also known as the Data Swamp or Data Dump.
What follows is both an attempt to answer the first two questions that were posed to me through a critical review of the article that was given as a reference. It is not my intention to insult Mr. Ratan, so if what I’ve written is offensive, I offer you, Mr. Ratan, my apology right up front, right now. I do tend to get rather sarcastic at times, and I certainly have done so in this article.
You see the speed of technology can be so overwhelming that it causes many businesses to select solutions without understanding the ramifications of those technical choices. I’m hoping to take some of the “shiny” off of these objects.
Build a Schema-On-Read … To Solve All Your Problems …
Sounds familiar, right … just dump all the data into Hadoop or to the Cloud (Azure, AWS S3 buckets, Google, … take your pick). Problem solved?!? Not quite …
First consideration, who wrote the article? Oh yeah, someone who’s trying to convince a potential buyer that AWS Athena is “all that”.
Second consideration, do you agree with the author that the set of criteria he has defined is the criteria for which an architect must design an analytical platform? I don’t. Where in the list of criteria do you see data governance (compliance, security, oversight), auditability, data integration, Master Data Management, operational continuity … to name just a few? What about storing the data one time for multiple use cases?
Third consideration, think about where the data used in his example was sourced. Where did that data come from? If we trace back to the beginning of the data set, it is clear that the Centers for Disease Control (CDC) certainly didn’t hand-jam the data into those .csv files … not at 1.4+ million records it didn’t. No, the data came from a relational database that processed the phone survey responses into a system that was based on business keys. The output of that processing was joined on the business keys to produce those .csv files. In essence, the data used was “dumped” to the AWS S3 bucket from a sophisticated database system and “dumbed down” to output to a series .csv files – 247 to be exact. While you may be chuckling about this, how many companies are actually doing this today at the direction of their architects, data scientists and other well-meaning “data wranglers”? Brilliant …? I think not … easy, yes; enduring, no.
Fourth consideration, the author’s description of the data integration challenge starts with ETL (extract-transform-load). Is that really what we should be doing? No. We extract-load and then transform (ELT) data. While this may seem like a subtle play on words it most definitely is not and carries with it tremendous implications on your data warehousing efforts.
In a more robust solution we work to store raw data one time so that it is reusable based on the business’ perspective. Our goals is to remove the performance-draining impact of applying complex business rules during the data ingest to the data warehouse and shift this complexity downstream to the information mart (see Diagram 5 – DV2 Architecture in Part 4 of this series). What the author is talking about and illustrates in Diagram 6 (see Part 4 of this series) is a Kimball Bus Architecture approach to data warehousing … it’s not information warehousing.
Fifth consideration, the whole schema-on-read issue … so, has the technology truly caught up and solved the issues surrounding volume, business rule complexity, conformity, and adaption to change in such a way as to eliminate the need for a physical model? With the exception of Snowflake DB, no, it has not. Schema-on-read is slow. We know it’s slow. The author admits it’s slow. Schema-on-Read, in most of the articles I’ve read, is often associated with some form of “data discovery” or “experimentation” or the like.
The author’s customer use cases sound like simple self-service BI, which lends itself to a data free-for-all. Where is the data governance in the author’s customer use cases? Supposedly, the data governance exists in the AWS Glue Data Catalog in some form or another. Does the business really receive any value in having a free-for-all with the enterprise data? Will the business receive a reliable, consistent answer to the same business question in such an environment? I wager that the answer to both of these questions is “no”, the business will not receive a reliable, consistent answer, and therefore, will not receive the value it deserves.
My point is this: Are we building a playground, or are we building an integrated data warehouse to solve and answer the problems and questions that the business has in a consistent, governed, adult-like – i.e., disciplined – approach? Aren’t we, as data engineers, data warehouse designers, architects, implementors, and BI analysts, attempting to give the business a degree of certainty around the answers provided from the integrated data or delivered information? I believe that we are.
There is certainly a place and a need for data discovery and experimentation, but are those the bases for sound business decisions? No, the experiments have to be tested and proved before their worth can be assessed, AND, the results of those experiments must be auditable.
Let’s breakdown the AWS Schema-on-Read analytics pipeline with an eye on data movement and replication to begin to answer the first question, “Does the size of the data now being delivered to the business make a compelling argument that the data should not be moved around as much?” If it is true that size matters, then one of our goals should be to reduce the number of times volumes of data are moved around or replicated.
As you’ll find in Part 4 of this series, I’ve introduced a viable solution to the replication and movement of data across the AWS environment; a solution that could be implemented in an AWS environment to provide sustainability, auditability, security, and scalability to name just a few benefits.
Step 1 – Dump the data files to the S3 bucket
In the author’s example, the data arrives in the S3 bucket after already having been separated into different .csv files based on specific behavioral risk factors (smoking, drinking, obesity, and high-blood pressure). How did these specific .csv files get generated into these categories? Oh yeah, they were output from a database using business keys and risk factor business logic to create the various subsets of data output to the .csv files. (Hey, that sounds a lot like DV2 methodology and architecture to me – business keys, soft business rules (business logic), and set logic math!)
It also sounds as though the data is being moved again.
For fun I decided to start counting the number of times the data had to move or had to be replicated throughout the example used in the article.
If you’re counting, the first move was from the digital survey input devices to the Centers for Disease Control (CDC) database.
The second move was from the CDC database to the .csv files.
Now, consider this. If the .csv files were written directly from the CDC database to the S3 bucket, then that’s two movements of the same data. However, if the files were output from the CDC database to a non-S3 landing zone or location and then copied from the output landing zone or location to the S3 bucket, the data would actually have been moved three times to get to the first S3 bucket.
Based on the diagram the author provided (reproduced below as Diagram 1), it appears that the generated .csv files are considered the source – Source 1 thru Source N specifically – which implies that the CDC database, the actual source system, is generating the .csv files to a landing zone or shared location to be picked up and copied to the S3 bucket. That means that at the onset this example starts with three data moves to land the data in the S3 bucket.
Diagram 1 (Ratan, 2017)
Once the data arrives in the S3 bucket, it is partitioned for performance improvement.
# of Data Movements into AWS: 3
In the next article, (Part 2) Does a Pattern-Based DW Architecture Still Hold Value Today?, I will address the next two steps outlined in Mr. Ratan’s data process flow. I will continue counting the number of times data is being replicated within the S3 buckets as I move toward answering the question, “Does the size of the data now being delivered to the business make a compelling argument that the data should not be moved around as much?”
References used throughout this series –
AWS. (2018, July 20). index. Retrieved from Amazon Web Services: https://aws.amazon.com/what-is-aws/
Cole, D. (2014). http://www.nybooks.com/daily/2014/05/10/we-kill-people-based-metadata/. Retrieved from http://www.nybooks.com: http://www.nybooks.com/daily/2014/05/10/we-kill-people-based-metadata/
Commission, E. (n.d.). https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en. Retrieved from https://ec.europa.eu/info/index_en: https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en
Ho. Jason Chaffetz, H. M. (2016). OPM Data Breach: How the Government Jeopardized Our National Security for More than a Generation. Washington D.C.: Committee on Oversight and Government Reform – U.S. House of Representatives – 114th Congress.
Inmon, W. (2016). Data Lake Architecture Designing the Data Lake and Avoiding the Garbage Dump. (R. A. Peters, Ed.) Basking Ridge, NJ, USA: Technics Publications.
Linstedt, D. (2016). Building a Scalable Data Warehouse with Data Vault 2.0. Waltham: Todd Green.
Linstedt, D. (2018, May). Data Vault 2.0 Architecture Diagram. Data Vault 2.0 Boot Camp & Private Certification.
Ratan, U. (2017, Sep 29). AWS Big Data Blog. Retrieved from https://aws.amazon.com: https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/