You are fine-tuning a pre-trained LLM for a domain-specific task. Explain what actually changes during fine-tuning vs pre-training: which layers to freeze and why, how learning rate scheduling differs (warmup, cosine decay, why you need lower LR than pre-training), which optimizer to use (AdamW vs Lion vs SGD), gradient accumulation for large batches, and how to detect and prevent catastrophic forgetting. Give a concrete example of a fine-tuning configuration for a 7B parameter model with limited GPU budget.