Semantic Chunking Strategies for Multimodal RAG

The quality of a RAG system depends more on how you chunk your data than on which embedding model you use. Poor chunking — splitting mid-sentence, breaking apart related paragraphs, or creating chunks that are too large to be specific — produces embeddings that do not accurately represent their content, leading to irrelevant retrieval results.

Semantic chunking solves this by splitting content at natural boundaries: topic shifts, paragraph breaks, scene changes, and structural markers. This post covers chunking strategies for text, documents, video, and audio.

Why Fixed-Size Chunking Fails

The most common chunking approach splits text every N tokens (typically 256-512) with some overlap. This is simple to implement but creates several problems:

Split sentences — A chunk boundary in the middle of a sentence produces two fragments, neither of which captures the complete idea
Mixed topics — A 512-token chunk might contain the end of one topic and the beginning of another, creating an embedding that represents neither well
Lost context — Important context like "the following table shows..." gets separated from the table it references
Redundant overlap — Overlap-based approaches duplicate content, inflating index size and retrieval noise

Embedding-Based Semantic Chunking

The most effective text chunking strategy uses embeddings to detect topic boundaries:

Split the document into sentences
Compute embeddings for each sentence
Calculate cosine similarity between consecutive sentence embeddings
When similarity drops below a threshold, insert a chunk boundary
Merge very short chunks with their neighbors

The threshold is typically set using a percentile of all consecutive similarities — the 25th percentile works well for most content, meaning boundaries are placed at the 25% largest similarity drops.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunk(text, percentile_threshold=25):
    # Split into sentences
    sentences = text.split(". ")

    # Compute embeddings
    embeddings = model.encode(sentences)

    # Calculate consecutive similarities
    similarities = [
        np.dot(embeddings[i], embeddings[i+1]) /
        (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
        for i in range(len(embeddings) - 1)
    ]

    # Find threshold
    threshold = np.percentile(similarities, percentile_threshold)

    # Split at low-similarity points
    chunks = []
    current_chunk = [sentences[0]]

    for i, sim in enumerate(similarities):
        if sim < threshold:
            chunks.append(". ".join(current_chunk) + ".")
            current_chunk = [sentences[i + 1]]
        else:
            current_chunk.append(sentences[i + 1])

    chunks.append(". ".join(current_chunk) + ".")
    return chunks

Document Layout Chunking

For structured documents (PDFs, HTML, Markdown), layout analysis provides even better chunking signals than embedding similarity:

Headings — Each heading starts a new chunk. Nested headings create hierarchical chunks.
Tables — Tables are kept as complete chunks, never split across boundaries.
Lists — Bulleted/numbered lists are kept together with their introductory text.
Code blocks — Code examples are preserved as atomic units.
Page boundaries — In PDFs, page breaks are natural (though not always semantic) boundaries.

The best approach combines structural signals with semantic analysis: use document structure as primary split points, then apply embedding-based splitting within large sections.

Video Chunking: Scene-Based Segmentation

For video content, the equivalent of semantic chunking is scene-based segmentation. Instead of splitting every N seconds, detect visual and audio boundaries:

Visual scene changes — Camera cuts, transitions, and significant visual shifts
Speaker changes — When a different person starts speaking (via diarization)
Topic shifts — Detected from the transcript using the same embedding-based approach as text
Silence boundaries — Pauses in audio often correspond to topic transitions

Each scene becomes a retrieval unit with its own embedding, transcript chunk, and metadata. This enables frame-accurate search results rather than returning entire videos.

Audio Chunking

Audio content (podcasts, calls, meetings) benefits from transcript-based semantic chunking combined with audio signals:

Speaker turns — Split when the speaker changes
Topic boundaries — Semantic chunking on the transcript
Silence detection — Pauses longer than a threshold indicate segment boundaries
Music/jingle detection — In podcasts, musical interludes separate segments

Chunk Size Guidelines

Content Type	Target Chunk Size	Rationale
General text	200-500 tokens	Balances specificity and context
Technical docs	300-800 tokens	Preserves code examples and explanations
FAQs	1 Q&A pair per chunk	Each pair is a complete retrieval unit
Video scenes	10-60 seconds	Long enough for context, short enough for relevance
Audio segments	30-120 seconds	Aligns with natural speech patterns

Measuring Chunking Quality

Evaluate your chunking strategy by measuring downstream retrieval quality:

Chunk coherence — Do chunks contain complete, self-contained ideas?
Retrieval recall — When you search for a known answer, does the relevant chunk appear in the top results?
Answer quality — Does the LLM produce better answers when using semantically chunked context vs. fixed-size chunks?

In our testing, semantic chunking typically improves retrieval recall by 15-25% compared to fixed-size chunking, with the largest gains on long documents with multiple topics.

Learn more about chunking in our glossary entry on semantic chunking, or explore document understanding for more on layout-aware processing.