Defining the Data Lake
For too many years now, the definition of Data Lake has been muddy at best. The term has no clear definition in the industry. While many have tried, they have not (in my opinion) succeeded. I would judge success as being able to implement a Data Lake at an enterprise level that makes sense of the solution and provides value to the enterprise at the right levels.
In this short 10 minute video we will explore Data Lakes, Data Warehouses, and what Data Lakes need in order to be successful. I’ve included an article about this subject for further reference and review.
The industry has been struggling for a long time with how to properly define a data lake. I have seen hundreds of different definitions around the world, and none of them seem to provide an organization with the foundations they need to build a successful data lake.
This post, along with this short 10 minute video, is meant to assist you in defining your data lake.
I need to take a minute to give special thanks to a few people. I am constantly reading what the industry says, but I am also constantly vetting these ideas and challenging my friends and colleagues to discuss WHY and HOW these ideas shape, change, and apply to the industry. Here are a list of the contributors to my ideas around Data Lakes:
- Tamara Dull (Unofficially blessed my definitions)
- Kent Graziano
- Sanjay Pande
- Bruce McCartney
- Scott Ambler
- Cindi Meyersohn
These folks were instrumental in helping me formulate the ideas in this video, and this blog post. I challenged them to assess the value of the statements made by the industry, along with what was a platform / vendor position vs a conceptual notion.
The original definition of Data Lake written by Tamara Dull in 2015 really doesn’t do justice to what a Data Lake is or should be. In fact, the original chart is a better comparison of physical platform properties (comparing Traditional RDBMS platforms to NewSQL / NoSQL Platforms). In order to properly define a Data Lake, we need to clarify what it is not.
In this case, we need to define:
Each one is unique in it’s own right. To many people in the industry use the term Data Lake when they really mean: Data Dump. The I.T. teams call it a Landing Zone to be politically correct, as business users grimace when they hear “Data Dump or Data Junkyard”.
Moving from Data Dump to Data Junkyard to Data Swamp to Landing Zone is a maturity curve. Each component requires specific definition, as well as specific features and processes to be applied to the data.
Data Lakes are a solution for Business Intelligence / analytics – they are not a platform, they are not a tool, they are not a file store in the cloud! Data Lakes are not Data Hubs, and this entry is not about defining a Data Hub – we will do that in another blog post soon.
Landing Zones are not a data lake!!
Data Warehouses are part of a Data Lake solution, but should not be compared to a Data Lake.
Keys to Success
There are several keys to success in building a Data Lake. Often businesses forget that good governance, security, and methodology are part of a data lake strategy. Typically data dumps are built, and labeled as “data lakes”, then when the business can’t use what’s built – they deem it unsuccessful (a failure) and shut it down.
Some of the reasons they can’t use it is: because it’s missing a solid definition, because it lacks business use cases, because the data isn’t properly defined / profiled / documented in metadata, or because there was a security breach at the file level.
A few of the top keys to success include:
- Profiling / Structure Assignment
- Metadata & Lineage
- Methodology & Approach
- Design and Architecture
- Flow (In and Out)
- Collaboration (I.T., Data Science groups and Business Analysts)
- Use Case / Purpose for the Data
- Leverage / Incorporation of a Data Warehouse
- Productionalization Processes
- Managed Self-Service BI
It is our opinion that these (and more) elements are necessary for success in building, deploying, and managing a Data Lake.