A Large Language Model is a neural network trained on massive amounts of text to predict the next token in a sequence — and that simple objective, applied at enormous scale, produces systems that can reason, write, and answer questions.
Why this appears in interviews
Every AI engineering problem eventually comes back to LLM mechanics. Interviewers want to know whether you understand what is actually happening under the hood — not because you need to implement a transformer, but because understanding tokens, context windows, and inference costs is essential for making good architectural decisions. A candidate who does not understand why context windowContext windowMaximum text an LLM can process at once, in tokens. Exceeding it causes earlier content to be forgotten.Learn more → size matters cannot design a sensible RAGRAGRetrieval-Augmented Generation — gives LLMs access to external knowledge by retrieving relevant documents before generating a response.Learn more → system.
The mental model
Imagine a very sophisticated autocomplete. You give it some text — "The capital of France is" — and it predicts what comes next: "Paris." Now scale that to billions of parameters trained on most of the internet, and the autocomplete becomes sophisticated enough to reason through complex problems, write code, and maintain coherent conversation.
The three things you must understand:
Tokens are not words. LLMs process tokens, which are roughly 3-4 characters each. "Hello world" is 2 tokens. "Supercalifragilistic" is 6 tokens. This matters because you are charged per token and context windows are measured in tokens, not words. Rule of thumb: multiply word count by 1.3 to estimate tokens.
The context windowContext windowMaximum text an LLM can process at once, in tokens. Exceeding it causes earlier content to be forgotten.Learn more → is everything the model can see at once. GPT-4 has a 128k token context windowContext windowMaximum text an LLM can process at once, in tokens. Exceeding it causes earlier content to be forgotten.Learn more →. Claude 3.5 Sonnet has a 200k token context windowContext windowMaximum text an LLM can process at once, in tokens. Exceeding it causes earlier content to be forgotten.Learn more →. Whatever is not in the context windowContext windowMaximum text an LLM can process at once, in tokens. Exceeding it causes earlier content to be forgotten.Learn more → does not exist to the model — it cannot access memories, prior conversations, or external information unless you put them in the context. This is why RAGRAGRetrieval-Augmented Generation — gives LLMs access to external knowledge by retrieving relevant documents before generating a response.Learn more → exists.
Inference is stateless. Every time you call an LLM API, it starts fresh. It has no memory of previous conversations unless you explicitly include those conversations in the current context.
How inference works step by step
- Your prompt (input tokens) is processed by the model in parallel.
- The model generates one token at a time — this is called autoregressive generation.
- Each new token is generated based on all previous tokens (your input + everything already generated).
- Generation stops when the model produces an end-of-sequence token or hits your
max_tokenslimit.
This explains why first-token latency is often fast (model starts quickly) while total latency scales with output length.
The KV cache
When generating a long response, the model needs to "remember" the context it has already seen. Rather than reprocessing the entire context from scratch for each new token, transformers use a Key-Value cache that stores intermediate computations. This dramatically speeds up generation but also uses significant GPU memory — which is why running long context windows requires expensive hardware.
Common interview mistakes
Mistake 1: Treating tokens as words. "This document is 10,000 words so it is 10,000 tokens." Wrong — it is likely 13,000-15,000 tokens. Always estimate with a 1.3x multiplier.
Mistake 2: Not understanding context windowContext windowMaximum text an LLM can process at once, in tokens. Exceeding it causes earlier content to be forgotten.Learn more → implications. "I'll just put the entire knowledge base in the context." For large knowledge bases this is either impossible or prohibitively expensive. This is why RAGRAGRetrieval-Augmented Generation — gives LLMs access to external knowledge by retrieving relevant documents before generating a response.Learn more → exists.
Mistake 3: Confusing training and inference. Training is when the model learns from data — expensive, done once. Inference is running the trained model — done on every request.
Key vocabulary
- Token — The basic unit of LLM processing, roughly 3-4 characters. Cost and context limits are measured in tokens.
- Context windowContext windowMaximum text an LLM can process at once, in tokens. Exceeding it causes earlier content to be forgotten.Learn more → — The maximum number of tokens an LLM can process in a single call (input + output combined).
- Inference — Running a trained model to generate output. Distinct from training.
- Autoregressive generation — Generating one token at a time, where each token depends on all previous tokens.
- Temperature — Controls randomness. Temperature 0 = deterministic. Temperature 1 = more creative and variable.