You're deployed at a customer who keeps asking, "How do we know the agent is actually working?" Build a small, runnable evaluation harness to a loose spec: given a set of test cases (a question plus an expected answer or the key facts it must contain), run each through the agent, decide whether the response is acceptable (an LLM-as-judge is fine), and produce a summary report with an overall pass rate and the specific failures. Keep it simple and pragmatic, and state any assumptions where the spec is loose. AI tools are allowed. Afterward, you'll explain to a non-technical sponsor what "working" means and how to read the report.