TL;DR: Data warehouses provide performance and governance, data lakes provide flexibility and scale, and lakehouses attempt to combine both using open table formats like Delta Lake and Iceberg. If your data is small or your team is SQL-only, a warehouse is often enough. Lakehouses shine when scale, ML, and multi-engine access matter.

Introduction

The data storage landscape has evolved dramatically over the past three decades. From the structured rigidity of data warehouses to the chaotic flexibility of data lakes, and now to the promising middle ground of lakehouses, each paradigm shift has been driven by changing business needs, technological capabilities, and lessons learned from previous approaches.

This article traces this evolution, explains the core concepts, and provides practical examples to help you understand when to use each approach.

This article assumes basic familiarity with SQL and data pipelines. PySpark examples are included for illustration, not as a requirement.

Part 1: The Data Warehouse Era (1990s-2000s)

The Genesis

Data warehouses emerged in the late 1980s and early 1990s as organizations needed to analyze historical data for business intelligence. Pioneers like Bill Inmon and Ralph Kimball established foundational architectures that are still influential today.

Core Principles

Schema-on-Write: Data is transformed and validated before loading into the warehouse. This ensures data quality and consistency but requires upfront schema design.

Structured Storage: Data warehouses are optimized for structured, tabular data. Everything follows a predefined schema with strong typing and constraints.

Star and Snowflake Schemas: Dimensional modeling techniques organize data into fact tables (events, transactions) and dimension tables (descriptive attributes).

Example: Traditional Data Warehouse Design

Let’s consider an e-commerce scenario. Here’s how you might model this in a traditional data warehouse:

-- Dimension: Products
CREATE TABLE dim_products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(200) NOT NULL,
    category VARCHAR(100),
    brand VARCHAR(100),
    unit_price DECIMAL(10,2),
    -- Slowly Changing Dimension Type 2 fields
    valid_from DATE,
    valid_to DATE,
    is_current BOOLEAN
);
 
-- Dimension: Customers
CREATE TABLE dim_customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(200),
    email VARCHAR(200),
    country VARCHAR(100),
    segment VARCHAR(50),
    valid_from DATE,
    valid_to DATE,
    is_current BOOLEAN
);
 
-- Dimension: Time
CREATE TABLE dim_date (
    date_id INT PRIMARY KEY,
    full_date DATE,
    day_of_week INT,
    month INT,
    quarter INT,
    year INT,
    is_holiday BOOLEAN
);
 
-- Fact: Sales
CREATE TABLE fact_sales (
    sale_id BIGINT PRIMARY KEY,
    date_id INT REFERENCES dim_date(date_id),
    product_id INT REFERENCES dim_products(product_id),
    customer_id INT REFERENCES dim_customers(customer_id),
    quantity INT,
    unit_price DECIMAL(10,2),
    discount_amount DECIMAL(10,2),
    tax_amount DECIMAL(10,2),
    total_amount DECIMAL(10,2)
);
 
-- Typical analytical query
SELECT 
    d.year,
    d.quarter,
    p.category,
    SUM(f.total_amount) as total_revenue,
    COUNT(DISTINCT f.customer_id) as unique_customers
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_products p ON f.product_id = p.product_id
WHERE d.year = 2024
GROUP BY d.year, d.quarter, p.category
ORDER BY d.quarter, total_revenue DESC;

The Limitations

As businesses evolved, several limitations became apparent:

  1. High Storage Costs: Enterprise data warehouses like Teradata and Oracle were expensive, charging premium prices for storage.

  2. Schema Rigidity: Adding new data sources required extensive ETL development and schema changes.

  3. Poor Support for Semi-Structured Data: JSON, XML, logs, and other non-tabular formats were difficult to integrate.

  4. Limited Scalability: Vertical scaling (bigger machines) hit physical and economic limits.

  5. Slow Adaptation: Business requirements changed faster than data warehouse schemas could evolve.

Part 2: The Data Lake Revolution (2010s)

The Hadoop Wave

The explosion of data volume, velocity, and variety in the 2010s coincided with the emergence of Hadoop and cheap cloud storage. This gave birth to the data lake concept.

Core Principles

Schema-on-Read: Store data in its raw form and apply schema only when reading. This provides maximum flexibility but shifts the responsibility of data quality and type validation from the system to the developer. Every reader must handle schema interpretation, which can lead to inconsistencies.

Multi-Format Support: Store structured (CSV, Parquet), semi-structured (JSON, Avro), and unstructured data (images, videos, logs) in the same repository.

Separation of Storage and Compute: Leverage cheap object storage (S3, Azure Blob, GCS) and scale compute independently.

ELT over ETL: Extract and Load first, Transform later as needed.

Example: Data Lake Architecture

Here’s a typical data lake structure on AWS S3, following the Medallion Architecture pattern (also known as Bronze-Silver-Gold):

s3://company-data-lake/
β”œβ”€β”€ raw/                          # BRONZE: Landing zone for raw ingested data
β”‚   β”œβ”€β”€ sales/                    # Exact copy of source, no transformations
β”‚   β”‚   β”œβ”€β”€ year=2024/
β”‚   β”‚   β”‚   β”œβ”€β”€ month=01/
β”‚   β”‚   β”‚   β”‚   └── sales_20240115.json
β”‚   β”‚   β”‚   └── month=02/
β”‚   β”‚   β”‚       └── sales_20240201.json
β”‚   β”œβ”€β”€ clickstream/
β”‚   β”‚   └── date=2024-01-15/
β”‚   β”‚       └── events.json.gz
β”‚   └── iot_sensors/
β”‚       └── sensor_readings_2024.parquet
β”œβ”€β”€ processed/                    # SILVER: Cleaned and validated data
β”‚   β”œβ”€β”€ sales_cleaned/            # Deduplicated, type-validated
β”‚   β”‚   └── year=2024/
β”‚   β”‚       └── month=01/
β”‚   β”‚           └── part-00000.parquet
β”‚   └── customer_profiles/
β”‚       └── profiles.parquet
└── curated/                      # GOLD: Business-ready aggregated datasets
    └── sales_analytics/          # Optimized for consumption
        └── monthly_summary.parquet

The Medallion Architecture became the de facto standard for organizing data lakes and later lakehouses, providing a clear progression from raw ingestion to business-ready analytics.

Processing Data Lake Files with PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, from_json
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
 
# Initialize Spark
spark = SparkSession.builder \
    .appName("DataLakeProcessing") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()
 
# Define schema for JSON data (schema-on-read)
sales_schema = StructType([
    StructField("transaction_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("customer_id", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("quantity", DoubleType(), True),
    StructField("amount", DoubleType(), True),
    StructField("metadata", StringType(), True)  # Nested JSON
])
 
# Read raw JSON from data lake
raw_sales = spark.read \
    .schema(sales_schema) \
    .json("s3://company-data-lake/raw/sales/year=2024/month=01/")
 
# Transform and clean
cleaned_sales = raw_sales \
    .filter(col("amount") > 0) \
    .filter(col("customer_id").isNotNull()) \
    .dropDuplicates(["transaction_id"]) \
    .withColumn("date", col("timestamp").cast("date"))
 
# Write to processed zone as Parquet (columnar, compressed)
cleaned_sales.write \
    .mode("overwrite") \
    .partitionBy("date") \
    .parquet("s3://company-data-lake/processed/sales_cleaned/year=2024/month=01/")
 
# Aggregate for analytics
monthly_summary = cleaned_sales \
    .groupBy("date", "product_id") \
    .agg(
        sum("amount").alias("total_revenue"),
        sum("quantity").alias("total_quantity"),
        count("transaction_id").alias("transaction_count")
    )
 
# Write to curated zone
monthly_summary.write \
    .mode("overwrite") \
    .parquet("s3://company-data-lake/curated/sales_analytics/monthly_summary/")

The Data Swamp Problem

While data lakes promised flexibility and cost savings, many organizations discovered they had created β€œdata swamps” instead:

  1. No Data Quality Guarantees: Without schema enforcement, bad data proliferated.

  2. Lack of ACID Transactions: No atomicity, consistency, isolation, or durability guarantees.

  3. No Schema Evolution: Changing data formats broke downstream consumers.

  4. Poor Performance: Reading raw JSON or CSV files is slow compared to optimized warehouse systems.

  5. Difficult Governance: Tracking lineage, access control, and compliance became nightmares.

  6. No Time Travel: Difficult to query historical states or rollback mistakes.

A Concrete Example of the Problem

# Attempting to read evolving schema from data lake
# Day 1: Original schema
df_day1 = spark.read.json("s3://data-lake/events/2024-01-01/")
# Schema: {user_id, event_type, timestamp}
 
# Day 2: New field added
df_day2 = spark.read.json("s3://data-lake/events/2024-01-02/")
# Schema: {user_id, event_type, timestamp, device_type}
 
# Day 3: Field renamed (breaking change!)
df_day3 = spark.read.json("s3://data-lake/events/2024-01-03/")
# Schema: {user_id, event_name, timestamp, device_type}  # event_type -> event_name
 
# Reading all together causes issues
df_all = spark.read.json("s3://data-lake/events/2024-01-*/")
# Result: Messy schema with nullable fields, missing columns, type conflicts
 
# No easy way to:
# - Enforce schema validation
# - Track who changed what and when
# - Rollback to previous version
# - Maintain data quality
 
# The Real Impact:
# When the `event_type` field was renamed to `event_name` on Day 3, every dashboard 
# and ML model relying on that field instantly failed. Because there was no transaction 
# log or schema enforcement, identifying exactly when the break happened required manual 
# log diving through application logs and git history, often taking hours or days to debug.

Part 3: The Lakehouse Paradigm (2020+)

The Best of Both Worlds

Lakehouses emerged as a response to the limitations of both warehouses and lakes. They aim to provide:

  • The flexibility and cost-effectiveness of data lakes
  • The reliability and performance of data warehouses

Key Technologies

Three main open-source table formats have emerged:

  1. Delta Lake (Databricks, 2019): Built on Parquet, adds transaction log
  2. Apache Iceberg (Netflix, 2018): Designed for scale, strong schema evolution
  3. Apache Hudi (Uber, 2016): Optimized for upserts and incremental processing

Core Features of Lakehouses

ACID Transactions: Ensure data consistency even with concurrent writes.

Schema Evolution: Add, remove, or rename columns without breaking existing readers.

Time Travel: Query data as it existed at any point in history.

Upserts and Deletes: Efficiently update and delete rows (not just append).

Unified Batch and Streaming: Same table format for both processing modes.

Better Performance: Indexing, caching, and query optimization.

Part 4: Practical Examples with Delta Lake

Let’s rebuild our e-commerce example using Delta Lake to see the differences.

Setting Up Delta Lake

from pyspark.sql import SparkSession
from delta import *
 
# Initialize Spark with Delta Lake
builder = SparkSession.builder \
    .appName("LakehouseExample") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
 
spark = configure_spark_with_delta_pip(builder).getOrCreate()

Creating Tables with ACID Guarantees

from pyspark.sql.functions import col, current_timestamp
from pyspark.sql.types import *
 
# Define schema with validation
sales_schema = StructType([
    StructField("transaction_id", StringType(), nullable=False),
    StructField("timestamp", TimestampType(), nullable=False),
    StructField("customer_id", StringType(), nullable=False),
    StructField("product_id", StringType(), nullable=False),
    StructField("quantity", IntegerType(), nullable=False),
    StructField("amount", DoubleType(), nullable=False)
])
 
# Sample data
data = [
    ("TXN001", "2024-01-15 10:30:00", "CUST001", "PROD001", 2, 49.99),
    ("TXN002", "2024-01-15 11:45:00", "CUST002", "PROD002", 1, 129.99),
    ("TXN003", "2024-01-15 14:20:00", "CUST001", "PROD003", 3, 29.99)
]
 
df = spark.createDataFrame(data, schema=sales_schema)
 
# Write as Delta table with partitioning
df.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("timestamp") \
    .save("s3://company-lakehouse/sales/")
 
# Create table metadata for SQL access
spark.sql("""
    CREATE TABLE IF NOT EXISTS sales
    USING DELTA
    LOCATION 's3://company-lakehouse/sales/'
""")

In production, partitioning by raw timestamps is discouraged due to high cardinality.

Schema Evolution in Action

# Initial write
initial_data = spark.createDataFrame([
    ("TXN004", "2024-01-16 09:00:00", "CUST003", "PROD001", 1, 49.99)
], schema=sales_schema)
 
initial_data.write.format("delta").mode("append").save("s3://company-lakehouse/sales/")
 
# Later: Add new column without breaking existing data
extended_schema = sales_schema.add(StructField("discount_code", StringType(), nullable=True))
 
new_data = spark.createDataFrame([
    ("TXN005", "2024-01-16 10:30:00", "CUST004", "PROD002", 2, 259.98, "WINTER20")
], schema=extended_schema)
 
# Schema evolution happens automatically
new_data.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("s3://company-lakehouse/sales/")
 
# Read back - old records have null for discount_code
full_data = spark.read.format("delta").load("s3://company-lakehouse/sales/")
full_data.show()

Time Travel: Query Historical Data

from delta.tables import DeltaTable
 
# Read current version
current = spark.read.format("delta").load("s3://company-lakehouse/sales/")
print(f"Current record count: {current.count()}")
 
# Read as of specific timestamp
historical = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2024-01-15 12:00:00") \
    .load("s3://company-lakehouse/sales/")
print(f"Historical record count: {historical.count()}")
 
# Read specific version number
version_2 = spark.read \
    .format("delta") \
    .option("versionAsOf", 2) \
    .load("s3://company-lakehouse/sales/")
 
# View history
deltaTable = DeltaTable.forPath(spark, "s3://company-lakehouse/sales/")
history = deltaTable.history()
history.select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)

UPSERT (Merge) Operations

One of the most powerful features: updating existing records efficiently.

from delta.tables import DeltaTable
from pyspark.sql.functions import col
 
# Load existing Delta table
salesTable = DeltaTable.forPath(spark, "s3://company-lakehouse/sales/")
 
# New data with updates and inserts
updates = spark.createDataFrame([
    ("TXN001", "2024-01-15 10:30:00", "CUST001", "PROD001", 2, 44.99, None),  # Price correction
    ("TXN006", "2024-01-17 08:00:00", "CUST005", "PROD003", 1, 29.99, None)   # New transaction
], schema=extended_schema)
 
# Perform UPSERT (merge)
salesTable.alias("target").merge(
    updates.alias("source"),
    "target.transaction_id = source.transaction_id"
).whenMatchedUpdate(
    set = {
        "amount": "source.amount",
        "quantity": "source.quantity"
    }
).whenNotMatchedInsert(
    values = {
        "transaction_id": "source.transaction_id",
        "timestamp": "source.timestamp",
        "customer_id": "source.customer_id",
        "product_id": "source.product_id",
        "quantity": "source.quantity",
        "amount": "source.amount",
        "discount_code": "source.discount_code"
    }
).execute()

DELETE Operations with Conditions

# Delete fraudulent transactions
salesTable = DeltaTable.forPath(spark, "s3://company-lakehouse/sales/")
 
salesTable.delete("amount < 0 OR amount > 10000")
 
# Can rollback if mistake was made
# spark.sql("RESTORE TABLE sales TO VERSION AS OF 5")

Optimizing Performance

# Optimize file layout (compaction)
spark.sql("OPTIMIZE sales")
 
# Z-ordering for better data skipping (multi-dimensional clustering)
spark.sql("OPTIMIZE sales ZORDER BY (customer_id, product_id)")
 
# Vacuum old files (clean up after 7 days retention)
spark.sql("VACUUM sales RETAIN 168 HOURS")

Comparison: Same Query Across Architectures

Let’s compare how the same analytical query performs across different architectures:

# ===== Data Warehouse (SQL) =====
# Pre-aggregated, optimized for this exact query
"""
SELECT 
    DATE_TRUNC('month', timestamp) as month,
    COUNT(*) as transaction_count,
    SUM(amount) as total_revenue
FROM sales
WHERE timestamp >= '2024-01-01'
GROUP BY month
ORDER BY month;
"""
# Performance: ~100ms (indexed, columnar storage, optimized)
 
# ===== Data Lake (Raw JSON/Parquet) =====
# Full table scan, no optimization
raw_lake_df = spark.read.parquet("s3://data-lake/raw/sales/")
result = raw_lake_df \
    .filter(col("timestamp") >= "2024-01-01") \
    .groupBy(date_trunc("month", "timestamp").alias("month")) \
    .agg(
        count("*").alias("transaction_count"),
        sum("amount").alias("total_revenue")
    ) \
    .orderBy("month")
# Performance: ~5-10s (full scan, no indexes)
 
# ===== Lakehouse (Delta Lake) =====
# ACID, partitioned, with stats and optimization
delta_df = spark.read.format("delta").load("s3://lakehouse/sales/")
result = delta_df \
    .filter(col("timestamp") >= "2024-01-01") \
    .groupBy(date_trunc("month", "timestamp").alias("month")) \
    .agg(
        count("*").alias("transaction_count"),
        sum("amount").alias("total_revenue")
    ) \
    .orderBy("month")
# Performance: ~500ms-2s (data skipping, stats, partitioning)
# After OPTIMIZE: ~200-500ms

Why the Performance Difference?

The lakehouse sits between the warehouse and raw lake for good reasons:

Why Warehouses are Fastest:

  • Purpose-built query engines with decades of optimization
  • Automatic statistics collection and cost-based optimization
  • Materialized views and pre-aggregations
  • Tight coupling between storage and compute

Why Lakehouses are Faster than Raw Lakes:

  • Data Skipping: Delta Lake maintains min/max statistics in the transaction log, allowing the engine to skip entire files that don’t match filter predicates
  • Z-Ordering: Multi-dimensional clustering co-locates related data, improving cache hit rates
  • Metadata Caching: The transaction log provides instant access to partition information without listing all files in object storage
  • Compaction: OPTIMIZE commands merge small files into larger, more efficient ones for scanning

Why Lakehouses are Slower than Warehouses:

  • Built on object storage (S3, ADLS) which has higher latency than proprietary warehouse storage
  • Transaction log overhead for ACID guarantees
  • Less mature query optimization compared to 20+ year old warehouse engines
  • General-purpose compute (Spark) vs specialized query engines

Part 5: Apache Iceberg - An Alternative Approach

While Delta Lake is popular, Apache Iceberg offers some distinct advantages, particularly for multi-engine environments.

Key Differences

  • Hidden Partitioning: Users don’t need to know partition structure
  • Partition Evolution: Change partitioning scheme without rewriting data
  • Multi-Engine Support: Works seamlessly with Spark, Trino, Flink, Dremio, etc.
  • Snapshot Isolation: Better handling of concurrent writes

The Rise of β€œIceberg Everywhere”

A significant trend in 2024-2025 has been the convergence of traditional data warehouses and lakehouses through Iceberg support:

  • Snowflake announced native Iceberg table support, allowing users to query Iceberg tables in their data lake directly from Snowflake
  • BigQuery added support for Iceberg tables stored in Google Cloud Storage
  • AWS integrated Iceberg deeply into the Glue Data Catalog and Athena

This convergence reinforces the conclusion that the industry is moving toward open table formats as the foundation layer, with different engines optimized for different workloads sitting on top. The β€œwarehouse vs lakehouse” distinction is blurring as warehouses adopt lake capabilities and lakehouses improve query performance.

Example with Iceberg

# Create Iceberg table
spark.sql("""
    CREATE TABLE catalog.db.sales (
        transaction_id STRING,
        timestamp TIMESTAMP,
        customer_id STRING,
        product_id STRING,
        quantity INT,
        amount DOUBLE
    )
    USING iceberg
    PARTITIONED BY (days(timestamp))
""")
 
# Write data
df.writeTo("catalog.db.sales").append()
 
# Change partition scheme (no data rewrite needed!)
spark.sql("""
    ALTER TABLE catalog.db.sales
    SET PARTITION SPEC (months(timestamp))
""")
 
# Time travel with Iceberg
spark.read \
    .option("snapshot-id", "3751520553990309440") \
    .table("catalog.db.sales")
 
# Or by timestamp
spark.read \
    .option("as-of-timestamp", "1642425600000") \
    .table("catalog.db.sales")

Part 6: When to Use What?

Data Warehouse (Snowflake, BigQuery, Redshift)

Best for:

  • Structured data with stable schemas
  • Heavy BI and reporting workloads
  • SQL-centric teams
  • Compliance-heavy industries requiring certified platforms
  • When simplicity and managed services are priorities

Not ideal for:

  • Semi-structured or unstructured data
  • ML/data science workflows requiring raw data access
  • Cost-sensitive applications with massive data volumes
  • Real-time streaming ingestion

Data Lake (S3/Parquet, Azure Data Lake)

Best for:

  • Archival and long-term storage
  • Raw data landing zone
  • Diverse data types and formats
  • Cost optimization (storage is cheap)
  • When you need maximum flexibility

Not ideal for:

  • Production analytics requiring consistency
  • Applications needing transactional guarantees
  • When data quality is critical
  • Teams without strong data engineering capabilities

Lakehouse (Delta Lake, Iceberg, Hudi)

Best for:

  • Modern data platforms combining analytics and ML
  • Organizations transitioning from data lakes
  • Scenarios requiring both batch and streaming
  • When you need ACID guarantees at data lake scale
  • Multi-engine, open-format requirements

Not ideal for:

  • Very small datasets (overhead not justified)
  • When your team lacks Spark/distributed computing expertise
  • Legacy systems deeply integrated with traditional warehouses

I would like to insist if your total data volume is under 100GB, the management overhead of Spark and Delta might be overkill compared to a simple Postgres instance or a small BigQuery project.

Conclusion: The Modern Data Stack

In practice, most organizations don’t choose just one approach. A typical modern architecture might look like:

Operational Data
    ↓
[Raw Data Lake (S3)]
    ↓
[Lakehouse Layer (Delta/Iceberg)]
    ↓
[Data Warehouse (Snowflake/BigQuery)] ← For critical BI
    ↓
[Semantic Layer / Metrics Store]
    ↓
[BI Tools / Applications]

The Governance Layer

One critical piece often missing from pure data lake implementations was comprehensive governance. Modern lakehouse platforms address this with tools like:

  • Unity Catalog (Databricks): Centralized governance for data, ML models, and AI assets
  • Apache Atlas (Open Source): Metadata management and data lineage tracking
  • AWS Lake Formation: Fine-grained access control and data catalog
  • Nessie (Project Nessie): Git-like versioning for data lakehouse tables

These governance layers solve the β€œDifficult Governance” problem that plagued early data lakes by providing:

  • Audit trails: Who accessed what data and when
  • Data lineage: Track transformations from source to consumption
  • Access control: Fine-grained permissions at table, column, and row levels
  • Data discovery: Searchable catalogs with business context

The key insight: Lakehouses don’t replace everything, but they solve many problems that plagued pure data lake architectures while maintaining flexibility and cost-effectiveness.

The evolution from warehouses β†’ lakes β†’ lakehouses reflects a maturation of the data engineering field. We’ve learned that:

  1. Structure matters, but rigidity is costly
  2. Flexibility is valuable, but chaos is expensive
  3. Governance is essential, not optional
  4. The middle path (lakehouse) often provides the best trade-offs

As the ecosystem continues to evolve, expect further convergence: warehouses adding lake capabilities (Snowflake’s Iceberg support), lakes adding warehouse features, and new abstractions that make these distinctions less relevant for end users. The future is likely to be multi-format and multi-engine, with Iceberg emerging as a neutral standard that works across platforms.