You are building a RAG pipeline for a corpus of mixed content: legal PDFs, code files, markdown documentation, and HTML web pages. Design your chunking strategy. Compare: fixed-size chunking (by tokens or characters), sentence/paragraph-based chunking, semantic chunking using embedding similarity, and structure-aware chunking (headings, sections, code blocks). For each content type in your corpus, which strategy works best and why? How does chunk size affect retrieval precision vs recall? What overlap strategy do you use and why?