Semantic Chunking Strategies for Multimodal RAG
How semantic chunking improves RAG quality by splitting content at natural boundaries rather than fixed token counts. Covers text, documents, video, and audio.

The quality of a RAG system depends more on how you chunk your data than on which embedding model you use. Poor chunking — splitting mid-sentence, breaking apart related paragraphs, or creating chunks that are too large to be specific — produces embeddings that do not accurately represent their content, leading to irrelevant retrieval results.
Semantic chunking solves this by splitting content at natural boundaries: topic shifts, paragraph breaks, scene changes, and structural markers. This post covers chunking strategies for text, documents, video, and audio.
Why Fixed-Size Chunking Fails
The most common chunking approach splits text every N tokens (typically 256-512) with some overlap. This is simple to implement but creates several problems:
- Split sentences — A chunk boundary in the middle of a sentence produces two fragments, neither of which captures the complete idea
- Mixed topics — A 512-token chunk might contain the end of one topic and the beginning of another, creating an embedding that represents neither well
- Lost context — Important context like "the following table shows..." gets separated from the table it references
- Redundant overlap — Overlap-based approaches duplicate content, inflating index size and retrieval noise
Embedding-Based Semantic Chunking
The most effective text chunking strategy uses embeddings to detect topic boundaries:
- Split the document into sentences
- Compute embeddings for each sentence
- Calculate cosine similarity between consecutive sentence embeddings
- When similarity drops below a threshold, insert a chunk boundary
- Merge very short chunks with their neighbors
The threshold is typically set using a percentile of all consecutive similarities — the 25th percentile works well for most content, meaning boundaries are placed at the 25% largest similarity drops.
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_chunk(text, percentile_threshold=25):
# Split into sentences
sentences = text.split(". ")
# Compute embeddings
embeddings = model.encode(sentences)
# Calculate consecutive similarities
similarities = [
np.dot(embeddings[i], embeddings[i+1]) /
(np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
for i in range(len(embeddings) - 1)
]
# Find threshold
threshold = np.percentile(similarities, percentile_threshold)
# Split at low-similarity points
chunks = []
current_chunk = [sentences[0]]
for i, sim in enumerate(similarities):
if sim < threshold:
chunks.append(". ".join(current_chunk) + ".")
current_chunk = [sentences[i + 1]]
else:
current_chunk.append(sentences[i + 1])
chunks.append(". ".join(current_chunk) + ".")
return chunks
Document Layout Chunking
For structured documents (PDFs, HTML, Markdown), layout analysis provides even better chunking signals than embedding similarity:
- Headings — Each heading starts a new chunk. Nested headings create hierarchical chunks.
- Tables — Tables are kept as complete chunks, never split across boundaries.
- Lists — Bulleted/numbered lists are kept together with their introductory text.
- Code blocks — Code examples are preserved as atomic units.
- Page boundaries — In PDFs, page breaks are natural (though not always semantic) boundaries.
The best approach combines structural signals with semantic analysis: use document structure as primary split points, then apply embedding-based splitting within large sections.
Video Chunking: Scene-Based Segmentation
For video content, the equivalent of semantic chunking is scene-based segmentation. Instead of splitting every N seconds, detect visual and audio boundaries:
- Visual scene changes — Camera cuts, transitions, and significant visual shifts
- Speaker changes — When a different person starts speaking (via diarization)
- Topic shifts — Detected from the transcript using the same embedding-based approach as text
- Silence boundaries — Pauses in audio often correspond to topic transitions
Each scene becomes a retrieval unit with its own embedding, transcript chunk, and metadata. This enables frame-accurate search results rather than returning entire videos.
Audio Chunking
Audio content (podcasts, calls, meetings) benefits from transcript-based semantic chunking combined with audio signals:
- Speaker turns — Split when the speaker changes
- Topic boundaries — Semantic chunking on the transcript
- Silence detection — Pauses longer than a threshold indicate segment boundaries
- Music/jingle detection — In podcasts, musical interludes separate segments
Chunk Size Guidelines
| Content Type | Target Chunk Size | Rationale |
|---|---|---|
| General text | 200-500 tokens | Balances specificity and context |
| Technical docs | 300-800 tokens | Preserves code examples and explanations |
| FAQs | 1 Q&A pair per chunk | Each pair is a complete retrieval unit |
| Video scenes | 10-60 seconds | Long enough for context, short enough for relevance |
| Audio segments | 30-120 seconds | Aligns with natural speech patterns |
Measuring Chunking Quality
Evaluate your chunking strategy by measuring downstream retrieval quality:
- Chunk coherence — Do chunks contain complete, self-contained ideas?
- Retrieval recall — When you search for a known answer, does the relevant chunk appear in the top results?
- Answer quality — Does the LLM produce better answers when using semantically chunked context vs. fixed-size chunks?
In our testing, semantic chunking typically improves retrieval recall by 15-25% compared to fixed-size chunking, with the largest gains on long documents with multiple topics.
Learn more about chunking in our glossary entry on semantic chunking, or explore document understanding for more on layout-aware processing.
