Concept · ~8 min read

What Is An LLM

A Large Language Model is a neural network trained on massive amounts of text to predict the next token in a sequence — and that simple objective, applied at enormous scale, produces systems that can reason, write, and answer questions.

Why this appears in interviews

Every AI engineering problem eventually comes back to LLM mechanics. Interviewers want to know whether you understand what is actually happening under the hood — not because you need to implement a transformer, but because understanding tokens, context windows, and inference costs is essential for making good architectural decisions. A candidate who does not understand why size matters cannot design a sensible system.

The mental model

Imagine a very sophisticated autocomplete. You give it some text — "The capital of France is" — and it predicts what comes next: "Paris." Now scale that to billions of parameters trained on most of the internet, and the autocomplete becomes sophisticated enough to reason through complex problems, write code, and maintain coherent conversation.

Comparing English words to tokens — words are not equivalent to tokens

The three things you must understand:

Tokens are not words. LLMs process tokens, which are roughly 3-4 characters each. "Hello world" is 2 tokens. "Supercalifragilistic" is 6 tokens. This matters because you are charged per token and context windows are measured in tokens, not words. Rule of thumb: multiply word count by 1.3 to estimate tokens.

The is everything the model can see at once. GPT-4 has a 128k token . Claude 3.5 Sonnet has a 200k token . Whatever is not in the does not exist to the model — it cannot access memories, prior conversations, or external information unless you put them in the context. This is why exists.

Inference is stateless. Every time you call an LLM API, it starts fresh. It has no memory of previous conversations unless you explicitly include those conversations in the current context.

How inference works step by step

  1. Your prompt (input tokens) is processed by the model in parallel.
  2. The model generates one token at a time — this is called autoregressive generation.
  3. Each new token is generated based on all previous tokens (your input + everything already generated).
  4. Generation stops when the model produces an end-of-sequence token or hits your max_tokens limit.

This explains why first-token latency is often fast (model starts quickly) while total latency scales with output length.

The KV cache

When generating a long response, the model needs to "remember" the context it has already seen. Rather than reprocessing the entire context from scratch for each new token, transformers use a Key-Value cache that stores intermediate computations. This dramatically speeds up generation but also uses significant GPU memory — which is why running long context windows requires expensive hardware.

Common interview mistakes

Mistake 1: Treating tokens as words. "This document is 10,000 words so it is 10,000 tokens." Wrong — it is likely 13,000-15,000 tokens. Always estimate with a 1.3x multiplier.

Mistake 2: Not understanding implications. "I'll just put the entire knowledge base in the context." For large knowledge bases this is either impossible or prohibitively expensive. This is why exists.

Mistake 3: Confusing training and inference. Training is when the model learns from data — expensive, done once. Inference is running the trained model — done on every request.

Key vocabulary

  • Token — The basic unit of LLM processing, roughly 3-4 characters. Cost and context limits are measured in tokens.
  • — The maximum number of tokens an LLM can process in a single call (input + output combined).
  • Inference — Running a trained model to generate output. Distinct from training.
  • Autoregressive generation — Generating one token at a time, where each token depends on all previous tokens.
  • Temperature — Controls randomness. Temperature 0 = deterministic. Temperature 1 = more creative and variable.
← Previous
Next · ProblemToken and Context Window Capacity Planning