A media company stores 50TB of raw user event data in S3 as JSON files. Their data science team wants to train recommendation models on the raw events. Their analyst team wants to run SQL queries on aggregations. Currently, queries against the raw JSON take hours and often fail.
(1) Diagnose what is wrong with the current architecture. Be specific about why JSON-on-S3 fails as a query target at this scale.
(2) Propose a lakehouse architecture serving both teams. What table format would you use (Delta Lake, Iceberg)? What query engine would you recommend? What file format and partitioning strategy?
(3) How would you handle nested JSON events that have evolved over time — fields added, types changed?
(4) Walk through the data flow for a single user event from app → analyst dashboard. What components transform it along the way?