By Cynthia Meyersohn
Continuing from Part 1 of this series, this article is following the breakdown of the AWS Schema-on-Read analytics pipeline with a focus on data movement and replication. You may recall we are tracing through the data processes outlined in “Build a Schema-On-Read Analytics Pipeline Using Amazon Athena”, by Ujjwal Ratan, Sep. 29, 2017, on the AWS Big Data Blog, https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/ (Ratan, 2017)
My goal is to begin to answer the question, “Does the size of the data now being delivered to the business make a compelling argument that the data should not be moved around as much?” If it is true that size matters, then one of our goals should be to reduce the number of times volumes of data are moved around or replicated.
Continued from Step 1 – Dump the data files to the S3 bucket …
Step 2 – Filter and data transformation
From the first S3 bucket, the dataset schema details are stored in the AWS Glue Data Catalog. Once in the AWS Glue Data Catalog they are further filtered and transformed into a subset of data prepared specifically for reporting. This subset is based on the schema details that were stored in the AWS Glue Data Catalog. During this step, the data’s schema-on-read is defined in a table structure that is created and managed in Athena. The Athena structure spans the entire data set (247 .csv files) and is used for profiling.
When we compare the process outlined above conceptually to a Data Vault 2.0 (DV2) solution (as an example), we find that ingesting data from the source to a non-persistent staging area would have taken only two moves. In a DV2 solution the data is ingested from the source into a DV2 target staging area or landing zone. The target staging area could be a relational database system, non-relational database system, new relational database system, or simply a file system sitting on your platform of choice – AWS S3, Hortonworks, Apache Spark, Cloudera, Linux, Windows, etc., depending upon the database application system, i.e., Oracle, SQL Server, Teradata, MySQL, Hive, Casandra, etc.
If the target staging area is a relational database application system, then the data would be written to the database as a Schema-On-Write. By its very nature a Schema-On-Write is already prepared for querying and data profiling. There are a couple of important differences however in using a Schema-On-Write approach.
The first difference is that moving data from the source to the DV2 target staging area would have taken two hops or moves – not three.
The second difference is that the metadata from the source system would have been preserved.
Whenever data is moved from a relational database system to a flat file (like a .csv), the metadata is lost … period. Metadata is critical to data engineering methods such as referential and attribution integrity, auditability, accuracy, etc. If you don’t believe this, then consider the following.
Before the infamous data breach reported by the Office of Personnel Management (OPM) in 2015, OPM had experienced a prior system breach that was not reported in 2013. Why wasn’t the 2013 breach reported? Well, because no “personnel data” actually left the system during the initial breach and therefore, the federal government was not obligated to report it.
However, what was extracted during the first detected breach were manuals and sensitive information that were useful for targeting OPM’s background information data systems (Ho. Jason Chaffetz, 2016), i.e., OPM’s system metadata. These incidents occurred in November and December of 2013.
Why is this significant? Because what was stolen first was the metadata.
Metadata, or “data about data,” is collected and recorded to describe data, identify trends, administer algorithmic solutions, and model potential scenarios. It is categorized as descriptive (identification details), structural (combination and container details) and administrative (creation, technical, and access details). (Cole, 2014) According to the Congressional report, the information exfiltrated during the first incident likely aided the second hacker “in navigating the OPM environment.” (Ho. Jason Chaffetz, 2016).
The OPM breaches have had both an immediate and far-reaching detrimental impact to our national security that will be felt for over a generation.
The lessen to be learned here? If anyone on your staff comes to you and tells you that they can just dump data from any kind of database to a flat file (.csv, .txt., etc.) into a Data Lake in order to provide reporting capability and that the attributes surrounding the original data don’t matter, in other words, the metadata isn’t important … fire them … immediately … because they clearly do not understand the importance and critical nature of metadata.
Doesn’t your business have not only the right but also the obligation and responsibility to ensure that the data ingested into the Enterprise Data Warehouse is intact and preserved as part of the corporate memory before it is manipulated? I would think so.
You may be saying to yourself, “But what if we don’t have a relational database application system to write to? What if we only have a non-relational database application on S3? What if we don’t have a database application system at all and want to just report on the data we dump to the S3 bucket? What if our data is in a different format like XML or JSON or in Excel spreadsheets?”
My response is this, the metadata is still critical.
All data has structure of some kind … all data. We aren’t processing bits and bytes that are floating freely in the air around us, are we? No, we are processing data that is contained in some kind of file structure or container – whether it is in a table, document, image, audio, video, JSON, XML, or spreadsheet (as examples).
The author erroneously states that the process in AWS of –
(1) moving the data to the S3 bucket,
(2) having AWS Glue Data Catalog store the schema details (metadata),
(3) building and loading the Athena structure spanning all of the 247 .csv files, and then
(4) profiling the data in the Athena table structure for reporting
is “much simpler compared to a schema-on-write approach where the analysis of the source dataset is much more difficult.” (Ratan, 2017)
Seriously? How is a schema-on-write more difficult than a schema-on-read, really? How is this process simpler? In the author’s example, the data is being moved multiple times just to get to the S3 bucket.
Here’s what is actually happening with the data.
(1) It is extracted from the CDC database which most likely has a built-in query engine,
(2) stripped of its metadata (dumbed down),
(3) written out to a bunch of flat files,
(4) copied to an S3 bucket,
(5) re-ingested to properly structure the Athena table definition, and
(6) profiled by means of reapplying the same schema and metadata that was stripped from the data when it left the source …
Or did I miss something?
As if the multiple processes outlined above are not enough, now the data has been written out to 247 files, any one of which could have been corrupted on their journey to the S3 bucket.
The author continues by saying that “… source data analysis is the first step to design a good analytical platform …” (Ratan, 2017)
The author continues …
”and this approach (i.e., outputting to a series of .csv files) allows customers to do that (i.e., perform source data analysis) much earlier in the project lifecycle.” (Ratan, 2017) (emphasis mine)
I don’t know what life-cycle the author is following but viewing or querying the data in a staging table for source data analysis, i.e., data profiling, seems equally, if not more, efficient.
Not only do I question the life-cycle being followed, but also I question what the author means by “design” for a good analytical platform. It seems to me that what the author has laid out thus far is reliant on the tooling technologies – for example, AWS Glue Data Catalog and Athena. Since when did tooling technologies equate to or replace sound architectural design based on source data analysis? They haven’t.
Tooling technologies come and go; but a sound data architectural design will endure and can embrace new tools and technologies as they emerge. This is precisely what the Data Vault 2.0 solution brings to the table – an enduring methodology, architecture, and model (whether implemented physically or logically) capable of absorbing changes to the underlying platforms and embracing emerging technologies.
Continuing on with the author’s proposed approach, we must add the fact that data integrity in a .csv file cannot be guaranteed because the data attributes were stripped from the data when generating the .csv files during the data output from the source system. Now, consider and contrast the author’s approach to a DV2 methodology and architecture.
In a DV2 methodology and architecture the metadata would have been preserved and applied to ensure that the data and its attributions are stored as intended, meaning a date would have been stored as a date, a number would have been stored as a number, and a string would have been stored as a string in accordance to the source system’s metadata. The data integrity would have been preserved.
Furthermore, processing the data from a Data Vault 2.0 staging area (data movement #2) directly into the Raw Data Vault EDW (data movement #3) as a hub and a satellite and then building a view (raw data mart) on top of the joined hub and satellite doesn’t seem like a great deal of complexity to me … in fact, it seems much easier.
First, by implementing a DV2 methodology and architecture the data is now stored in the Enterprise Data Warehouse as is and in only three moves (if you’re still counting). Second, the raw data now is captured and ready for any type of profiling or transformation.
The bottom line is that –
or put another way –
Capturing the raw data with its attributes (metadata) becomes an imperative for data integrity, auditing, and data governance (to name but a few). You simply cannot get around these factors. This is why I do believe that a pattern-driven data warehousing architecture still holds value.
But I digress …
Step 3 – Reporting table
After profiling the data through Athena, the next step is to extract the required data and transform it to create a reporting database. Gee, sounds like a DV2 Business Vault and/or Information Mart to me.
According to the author, Athena saves each query result in a separate .csv file in S3 for every execution of the query and organizes the file names by date. In the example given, there are three different queries used to filter out and transform the subset of data used for reporting. At this point the data has been moved or replicated 6 times at a minimum because each query is saved with the subset results into one or more .csv files in the S3 bucket.
Once the subset query results are generated, the “next step is to copy these results to a new (S3) bucket for creating your reporting table.” (Ratan, 2017) This is the ninth time the data has been moved or replicated (see Diagram 2). If our objective is to reduce the number of times volumes of data are moved then it seems reasonable to begin questioning whether or not this process is exacerbating a recognized problem in the “Big Data” realm.
# of Data Movements – 9 (minimum!)
In the next article, (Part 3) Does a Pattern-Based DW Architecture Still Hold Value Today?, I will address the final two steps outlined in Mr. Ratan’s process flow and continue counting the number of times data is being replicated within the S3 buckets. The answer to the original question, “Does the size of the data now being delivered to the business make a compelling argument that the data should not be moved around as much?”, is becoming somewhat self-evident.
Consider for a moment what happens when the size of the dataset or datasets reach terabytes, petabytes, or larger? Can the business afford the replication, storage, and processing time and associated costs required to prepare the data so the business can report against it?
In Part 4 of my article I’ll lay out what I believe to be a solid technical approach to address the issues surrounding how to get the most out of your Data Lake implementation.
References used throughout this series –
AWS. (2018, July 20). index. Retrieved from Amazon Web Services: https://aws.amazon.com/what-is-aws/
Cole, D. (2014). http://www.nybooks.com/daily/2014/05/10/we-kill-people-based-metadata/. Retrieved from http://www.nybooks.com: http://www.nybooks.com/daily/2014/05/10/we-kill-people-based-metadata/
Commission, E. (n.d.). https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en. Retrieved from https://ec.europa.eu/info/index_en: https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en
Ho. Jason Chaffetz, H. M. (2016). OPM Data Breach: How the Government Jeopardized Our National Security for More than a Generation. Washington D.C.: Committee on Oversight and Government Reform – U.S. House of Representatives – 114th Congress.
Inmon, W. (2016). Data Lake Architecture Designing the Data Lake and Avoiding the Garbage Dump. (R. A. Peters, Ed.) Basking Ridge, NJ, USA: Technics Publications.
Linstedt, D. (2016). Building a Scalable Data Warehouse with Data Vault 2.0. Waltham: Todd Green.
Linstedt, D. (2018, May). Data Vault 2.0 Architecture Diagram. Data Vault 2.0 Boot Camp & Private Certification.
Ratan, U. (2017, Sep 29). AWS Big Data Blog. Retrieved from https://aws.amazon.com: https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/