Walk through the attention mechanism in transformers — specifically what Q, K, and V matrices represent, how scaled dot-product attention is computed, what multi-head attention achieves that single-head cannot, and why the softmax(QK^T/sqrt(d_k))V formula works. Then explain positional encoding: why transformers need it, the difference between absolute (sinusoidal), learned, and relative positional encodings (RoPE, ALiBi), and how positional encoding affects long-context performance.