Concept · ~8 min read

Warehouse Lake Lakehouse

A data warehouse stores structured, processed data optimized for SQL queries; a data lake stores raw data in any format at low cost; a lakehouse combines both — raw storage with warehouse-quality query performance.

Why this appears in interviews

Every data engineering system design question involves choosing where to store data. This decision determines cost, query performance, schema flexibility, and what teams can do with the data.

The mental model — three kinds of libraries

Data warehouse — A curated library where every book is catalogued and indexed. Fast to find. Only books meeting standards are admitted, and adding new ones requires a cataloguing process.

Data lake — A storage unit where you throw everything — books, photos, handwritten notes, audio. Admission instant and free. Finding something specific is hard unless you know where you put it.

Lakehouse — A storage unit where you can store everything, but a smart cataloguing system can read any of it quickly through a standard query interface.

Data warehouse

What it is: Structured repository of cleaned, transformed data optimized for SQL. Schema defined before data is loaded (schema-on-write).

Strengths: Fast SQL queries, strong governance, reliable data quality, easy for analysts.

Weaknesses: Expensive at large scale, rigid schema, poor support for unstructured data.

Examples: Snowflake, BigQuery, Redshift.

Data lake

What it is: Centralized store of raw data in original format in cheap object storage (S3, GCS). Schema applied at read time (schema-on-read).

Strengths: Cheap storage, stores anything, good for ML training data.

Weaknesses: Slow queries without additional tooling, easy to create a "data swamp," poor ACID guarantees.

Lakehouse

What it is: Adds a metadata and governance layer on top of a data lake, enabling ACID transactions, schema enforcement, and fast queries while keeping object storage cost benefits.

How it works: Data stored in open formats (Parquet) in object storage. A table format layer (Delta Lake, Apache Iceberg) adds versioning and ACID. A query engine (Spark, Trino) reads metadata for fast queries.

Examples: Databricks Lakehouse, Apache Iceberg on S3, Delta Lake.

How to choose

Fast SQL for analysts, strong data quality → Data warehouse. Cheap raw storage, ML training data → Data lake. Both SQL analytics and ML on large datasets → Lakehouse. Small company, budget conscious → Start with a managed warehouse.

Common interview mistakes

Mistake 1: Treating data lake as synonymous with S3. Without a catalog and governance layer, it is a data swamp.

Mistake 2: Not knowing why lakehouses emerged. Warehouses were too expensive at petabyte scale; data lakes were too hard for analysts to use.

Mistake 3: Not mentioning table formats. Any serious answer about lakehouses should reference Apache Iceberg or Delta Lake.

Key vocabulary

  • Schema-on-write — Schema must be defined before data is written (data warehouse). Enforces quality but reduces flexibility.
  • Schema-on-read — Schema is applied when data is read (data lake). Flexible ingestion but inconsistent quality.
  • Delta Lake / Apache Iceberg — Open table formats that add ACID transactions, versioning, and schema evolution to data lakes.
  • Parquet — A columnar file format used in data lakes and lakehouses. Efficient for analytical queries.
← Previous
Next · ProblemLakehouse Architecture for 50TB of JSON Events