defining a data lake

Architecture & Data Systems

What is a Data Lake?

ByDaniel Linstedt 2019-08-262026-01-23

Introduction

The industry has been struggling for a long time with defining a data lake. We are taking the plunge, let’s properly define a data lake. I have seen hundreds of different definitions around the world, and none of them seem to provide an organization with the foundations they need to build a successful data lake.

This post, along with this short 10 minute video, is meant to assist you in defining your data lake.

Special Thanks

I need to take a minute to give special thanks to a few people. I am constantly reading what the industry says, but I am also constantly vetting these ideas and challenging my friends and colleagues to discuss WHY and HOW these ideas shape, change, and apply to the industry. Here are a list of the contributors to my ideas around Data Lakes:

Tamara Dull (Unofficially blessed my definitions)
Kent Graziano
Sanjay Pande
Bruce McCartney
Scott Ambler
Cindi Meyersohn

These folks were instrumental in helping me formulate the ideas in this video, and this blog post. I challenged them to assess the value of the statements made by the industry, along with what was a platform / vendor position vs a conceptual notion.

Defining A Data Lake – Definitions

The original definition of Data Lake written by Tamara Dull in 2015 really doesn’t do justice to what a Data Lake is or should be. In fact, the original chart is a better comparison of physical platform properties (comparing Traditional RDBMS platforms to NewSQL / NoSQL Platforms). In order to properly define a Data Lake, we need to clarify what it is not.

In this case, we need to define:

Data Dump
Data Junkyard
Data Swamp
Landing Zone

Each one is unique in it’s own right. To many people in the industry use the term Data Lake when they really mean: Data Dump. The I.T. teams call it a Landing Zone to be politically correct, as business users grimace when they hear “Data Dump or Data Junkyard”.

Moving from Data Dump to Data Junkyard to Data Swamp to Landing Zone is a maturity curve. Each component requires specific definition, as well as specific features and processes to be applied to the data.

Data Lakes are a solution for Business Intelligence / analytics – they are not a platform, they are not a tool, they are not a file store in the cloud! Data Lakes are not Data Hubs, and this entry is not about defining a Data Hub – we will do that in another blog post soon.

Landing Zones are not a data lake!!

Data Warehouses are part of a Data Lake solution, but should not be compared to a Data Lake.

Keys to Success

There are several keys to success in building a Data Lake. Often businesses forget that good governance, security, and methodology are part of a data lake strategy. Typically data dumps are built, and labeled as “data lakes”, then when the business can’t use what’s built – they deem it unsuccessful (a failure) and shut it down.

Some of the reasons they can’t use it is: because it’s missing a solid definition, because it lacks business use cases, because the data isn’t properly defined / profiled / documented in metadata, or because there was a security breach at the file level.

A few of the top keys to success include:

Governance
Security
Profiling / Structure Assignment
Metadata & Lineage
Methodology & Approach
Design and Architecture
Flow (In and Out)
Collaboration (I.T., Data Science groups and Business Analysts)
Use Case / Purpose for the Data
Leverage / Incorporation of a Data Warehouse
Productionalization Processes
Managed Self-Service BI

It is our opinion that these (and more) elements are necessary for success in building, deploying, and managing a Data Lake. Defining a data lake or data lake architecture is challenging with buzz words running amok. We hope our continued pursuit of defining a data lake will help you and your organization. Thank you for reading.

Podcast | News

Episode 5: Defining Business Terms and How DV2 Accelerated JP Morgan Chase
ByDaniel Linstedt 2022-07-182026-01-23

Join us for Unlocking the Vault with Dan Linstedt It’s time to step back and define KPAs and KPIs in business terms. To do that, we have to ask “why are they important?” We revisit mergers and acquisitions to illustrate their importance, addressing the challenges these business activities introduce and how Data Vault resolves them in…

Read More Episode 5: Defining Business Terms and How DV2 Accelerated JP Morgan Chase
AI & Automation

Poison Data and Espionage, AI, ML, Deep Learning
ByDaniel Linstedt 2025-10-22

In Episode 2 of Unlocking the Data Vault, I mentioned the term: Poison Data. I must admit, it’s a cool term, and I also admit the term is not mine. It stems from espionage in an attempt to poison data that feeds AI / ML and deep learning algorithms. It distorts documents, imagery, video, and…

Read More Poison Data and Espionage, AI, ML, Deep Learning
Strategy & Operating Models

Centralized vs Decentralized Architecture
ByCindi Meyersohn 2025-10-222026-01-23

By Mark Budzinski Centralized vs Decentralized Architecture is a topic we have been discussing in our podcast and blogs lately. We often like to frame the discussion of centralization as a war, an either/or focused on the versus in the middle. Because of this, the flaws of both sides are brought to the forefront: on…

Read More Centralized vs Decentralized Architecture
Evidence & Perspective

WEBINAR: Understanding DV2 Agile Implementation
ByDaniel Linstedt 2023-09-052026-01-23

Come TALK TO ME LIVE! Bring your questions – even if they aren’t about Agile Implementation, this is your chance to talk with me live. The Data Vault 2.0 Agile Implementation Cycle is a key piece and very often misunderstood. Using and leveraging it correctly can make or break your projects. Too many teams get stuck here or do this incorrectly, causing all sorts of issues.

Read More WEBINAR: Understanding DV2 Agile Implementation
Podcast | News

Episode 12: Migrating your Legacy Enterprise Analytics System to a Data Vault 2.0 Solution
ByDaniel Linstedt 2022-08-302026-01-23

Suppose your organization is planning their digital transformation and modernization strategy. As part of that strategy, you want to move off your legacy data warehouse and migrate to a Data Vault 2.0 Solution for your analytics methodology. How do you execute that strategy? In this episode of Unlocking the Vault, Dan and Cindi discuss how…

Read More Episode 12: Migrating your Legacy Enterprise Analytics System to a Data Vault 2.0 Solution
Engineering & Implementation

Does a Pattern-Based DW Architecture Still Hold Value Today? (Part 2 of 4)
ByCindi Meyersohn 2019-07-152026-01-23

By Cynthia Meyersohn Continuing from Part 1 of this series, this article is following the breakdown of the AWS Schema-on-Read analytics pipeline with a focus on data movement and replication. You may recall we are tracing through the data processes outlined in “Build a Schema-On-Read Analytics Pipeline Using Amazon Athena”, by Ujjwal Ratan, Sep. 29,…

Read More Does a Pattern-Based DW Architecture Still Hold Value Today? (Part 2 of 4)