Semantic chunking is a document preprocessing technique that splits text into segments aligned with natural content boundaries — paragraphs, topics, or logical sections — rather than using fixed character or token counts. This produces chunks that preserve complete ideas and context, leading to better embedding quality and more relevant search results in RAG and retrieval systems.
Semantic chunking analyzes the content structure to identify natural break points. Common approaches include embedding-based splitting (computing embeddings for sentences and splitting where similarity drops), topic boundary detection, and structural parsing (using headings, paragraphs, and formatting cues). Each resulting chunk contains a coherent unit of information.
Embedding-based semantic chunking computes sentence-level embeddings and measures cosine similarity between consecutive sentences. When similarity drops below a threshold, a chunk boundary is inserted. More advanced methods use sliding windows with percentile-based thresholds. For structured documents (HTML, Markdown, PDF), layout analysis identifies headings, sections, and visual boundaries as natural chunk points.