EasyData EngineeringSystem Design

Lakehouse Architecture for 50TB of JSON Events

A media company stores 50TB of raw user event data in S3 as JSON files. Their data science team wants to train recommendation models on the raw events. Their analyst team wants to run SQL queries on aggregations. Currently, queries against the raw JSON take hours and often fail.

(1) Diagnose what is wrong with the current architecture. Be specific about why JSON-on-S3 fails as a query target at this scale.

(2) Propose a lakehouse architecture serving both teams. What table format would you use (Delta Lake, Iceberg)? What query engine would you recommend? What file format and partitioning strategy?

(3) How would you handle nested JSON events that have evolved over time — fields added, types changed?

(4) Walk through the data flow for a single user event from app → analyst dashboard. What components transform it along the way?

Sign in to attempt this problem

Free account gives you full access to community problems with the complete solution reveal — golden answer, senior walkthrough, and score breakdown — after submission.

Start free →Already have an account? Sign in