You have a Spark job joining a 5TB orders table with a 2TB customers table on customer_id. The job takes 4 hours. In the Spark UI, 195 of 200 tasks complete in under 2 minutes and 5 tasks take over 3 hours. (1) Diagnose what is happening. (2) You discover 40% of all orders are from your top 100 customers. Propose two specific solutions, describe the tradeoff between them, and write the PySpark code for one. (3) The Spark UI shows shuffle write 800GB. What does this represent and what would you try to reduce it?