Concept · ~8 min read

What Is A Data Engineer

A Data Engineer builds and maintains the pipelines, infrastructure, and systems that move, transform, and store data so that analysts, data scientists, and ML engineers can use it reliably.

Why this appears in interviews

Data engineering is one of the most frequently misunderstood roles in tech. Getting the distinction right before any technical question signals you know what the job requires.

The mental model — the plumbing behind the house

Think of a city's water system. Residents turn on a tap and water flows — they never think about the pipes, pumps, and treatment plants. When it works, nobody notices. When a pipe bursts, everything stops. Data engineers build the plumbing. You are responsible for data as infrastructure, not data as insight.

How the role differs

Data Analyst: Consumes clean data to answer business questions. Uses SQL, Tableau. Does not build pipelines.

Data Scientist: Builds models using data prepared by engineers. Relies on data engineers for reliable data access.

ML Engineer: Deploys models. Overlaps on feature pipelines.

Data Engineer: Builds the systems that make all of the above possible. Owns the pipeline from raw source to clean, queryable tables. Cares about freshness, reliability, schema stability, and query performance.

What data engineers actually build

  • Writing models transforming raw Stripe events into a clean payments table in Snowflake
  • Debugging an Airflow DAG failing because an upstream API changed its response schema
  • Partitioning a Spark job processing 500GB of clickstream data per day
  • Setting up data quality checks alerting when a table has unexpected null rates
  • Designing a schema for a new data product that three teams will depend on

The modern data stack

Fivetran/Airbyte (ingestion) → Snowflake/BigQuery/Databricks (storage) → (transformation) → Airflow/Prefect (orchestration) → Monte Carlo/Great Expectations (monitoring).

Common interview mistakes

Mistake 1: Describing analytics as data engineering. "I built dashboards in Tableau." That is analysis. Data engineering is what makes the data behind those dashboards reliable.

Mistake 2: Not thinking about reliability. The most important properties of a pipeline are how it behaves when an upstream source is late, a schema changes, or a job fails at 3am.

Mistake 3: Treating SQL as the only skill. Production data engineering also requires distributed systems, orchestration, storage formats, and cost optimization.

Key vocabulary

  • Pipeline — A sequence of data processing steps that moves data from a source to a destination, transforming it along the way.
  • — Extract-Transform-Load / Extract-Load-Transform. Two patterns for moving data into a warehouse.
  • Data warehouse — A centralized store of structured, queryable data organized for analytics.
  • Orchestration — Managing the scheduling, dependencies, and failure handling of data pipelines.
Next · ProblemJoining Postgres, Stripe, and Salesforce for Analytics