A Data Engineer builds and maintains the pipelines, infrastructure, and systems that move, transform, and store data so that analysts, data scientists, and ML engineers can use it reliably.
Why this appears in interviews
Data engineering is one of the most frequently misunderstood roles in tech. Getting the distinction right before any technical question signals you know what the job requires.
The mental model — the plumbing behind the house
Think of a city's water system. Residents turn on a tap and water flows — they never think about the pipes, pumps, and treatment plants. When it works, nobody notices. When a pipe bursts, everything stops. Data engineers build the plumbing. You are responsible for data as infrastructure, not data as insight.
How the role differs
Data Analyst: Consumes clean data to answer business questions. Uses SQL, Tableau. Does not build pipelines.
Data Scientist: Builds models using data prepared by engineers. Relies on data engineers for reliable data access.
ML Engineer: Deploys models. Overlaps on feature pipelines.
Data Engineer: Builds the systems that make all of the above possible. Owns the pipeline from raw source to clean, queryable tables. Cares about freshness, reliability, schema stability, and query performance.
What data engineers actually build
- Writing dbtdbtData Build Tool — SQL-based transformation layer that makes data transformations version-controlled and modular. models transforming raw Stripe events into a clean payments table in Snowflake
- Debugging an Airflow DAG failing because an upstream API changed its response schema
- Partitioning a Spark job processing 500GB of clickstream data per day
- Setting up data quality checks alerting when a table has unexpected null rates
- Designing a schema for a new data product that three teams will depend on
The modern data stack
Fivetran/Airbyte (ingestion) → Snowflake/BigQuery/Databricks (storage) → dbtdbtData Build Tool — SQL-based transformation layer that makes data transformations version-controlled and modular. (transformation) → Airflow/Prefect (orchestration) → Monte Carlo/Great Expectations (monitoring).
Common interview mistakes
Mistake 1: Describing analytics as data engineering. "I built dashboards in Tableau." That is analysis. Data engineering is what makes the data behind those dashboards reliable.
Mistake 2: Not thinking about reliability. The most important properties of a pipeline are how it behaves when an upstream source is late, a schema changes, or a job fails at 3am.
Mistake 3: Treating SQL as the only skill. Production data engineering also requires distributed systems, orchestration, storage formats, and cost optimization.
Key vocabulary
- Pipeline — A sequence of data processing steps that moves data from a source to a destination, transforming it along the way.
- ETL / ELTETL / ELTExtract-Transform-Load / Extract-Load-Transform — patterns for moving data from source systems to a data warehouse. — Extract-Transform-Load / Extract-Load-Transform. Two patterns for moving data into a warehouse.
- Data warehouse — A centralized store of structured, queryable data organized for analytics.
- Orchestration — Managing the scheduling, dependencies, and failure handling of data pipelines.