HardAI EngineeringSystem Design

What is your batching and caching strategy to reduce LLM latency?

Design a batching and caching strategy for a high-traffic LLM application with p99 latency requirements of under 2 seconds. Cover: continuous batching vs static batching and when each applies, KV cache management and how context length affects it, semantic caching for near-duplicate queries (not just exact match), prefix caching for shared system prompts, how to implement request queuing without blowing your latency budget, and how these strategies interact — e.g. how caching and batching can conflict. What does your architecture look like end-to-end?

Sign in to attempt this problem

Free account gives you full access to community problems with the complete solution reveal — golden answer, senior walkthrough, and score breakdown — after submission.

Start free →Already have an account? Sign in