Mixpeek Logo
    3 min read

    Semantic Chunking Strategies for Multimodal RAG

    How semantic chunking improves RAG quality by splitting content at natural boundaries rather than fixed token counts. Covers text, documents, video, and audio.

    Semantic Chunking Strategies for Multimodal RAG
    Comparisons

    The quality of a RAG system depends more on how you chunk your data than on which embedding model you use. Poor chunking — splitting mid-sentence, breaking apart related paragraphs, or creating chunks that are too large to be specific — produces embeddings that do not accurately represent their content, leading to irrelevant retrieval results.

    Semantic chunking solves this by splitting content at natural boundaries: topic shifts, paragraph breaks, scene changes, and structural markers. This post covers chunking strategies for text, documents, video, and audio.

    Why Fixed-Size Chunking Fails

    The most common chunking approach splits text every N tokens (typically 256-512) with some overlap. This is simple to implement but creates several problems:

    • Split sentences — A chunk boundary in the middle of a sentence produces two fragments, neither of which captures the complete idea
    • Mixed topics — A 512-token chunk might contain the end of one topic and the beginning of another, creating an embedding that represents neither well
    • Lost context — Important context like "the following table shows..." gets separated from the table it references
    • Redundant overlap — Overlap-based approaches duplicate content, inflating index size and retrieval noise

    Embedding-Based Semantic Chunking

    The most effective text chunking strategy uses embeddings to detect topic boundaries:

    1. Split the document into sentences
    2. Compute embeddings for each sentence
    3. Calculate cosine similarity between consecutive sentence embeddings
    4. When similarity drops below a threshold, insert a chunk boundary
    5. Merge very short chunks with their neighbors

    The threshold is typically set using a percentile of all consecutive similarities — the 25th percentile works well for most content, meaning boundaries are placed at the 25% largest similarity drops.

    import numpy as np
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer("all-MiniLM-L6-v2")
    
    def semantic_chunk(text, percentile_threshold=25):
        # Split into sentences
        sentences = text.split(". ")
    
        # Compute embeddings
        embeddings = model.encode(sentences)
    
        # Calculate consecutive similarities
        similarities = [
            np.dot(embeddings[i], embeddings[i+1]) /
            (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
            for i in range(len(embeddings) - 1)
        ]
    
        # Find threshold
        threshold = np.percentile(similarities, percentile_threshold)
    
        # Split at low-similarity points
        chunks = []
        current_chunk = [sentences[0]]
    
        for i, sim in enumerate(similarities):
            if sim < threshold:
                chunks.append(". ".join(current_chunk) + ".")
                current_chunk = [sentences[i + 1]]
            else:
                current_chunk.append(sentences[i + 1])
    
        chunks.append(". ".join(current_chunk) + ".")
        return chunks
    

    Document Layout Chunking

    For structured documents (PDFs, HTML, Markdown), layout analysis provides even better chunking signals than embedding similarity:

    • Headings — Each heading starts a new chunk. Nested headings create hierarchical chunks.
    • Tables — Tables are kept as complete chunks, never split across boundaries.
    • Lists — Bulleted/numbered lists are kept together with their introductory text.
    • Code blocks — Code examples are preserved as atomic units.
    • Page boundaries — In PDFs, page breaks are natural (though not always semantic) boundaries.

    The best approach combines structural signals with semantic analysis: use document structure as primary split points, then apply embedding-based splitting within large sections.

    Video Chunking: Scene-Based Segmentation

    For video content, the equivalent of semantic chunking is scene-based segmentation. Instead of splitting every N seconds, detect visual and audio boundaries:

    • Visual scene changes — Camera cuts, transitions, and significant visual shifts
    • Speaker changes — When a different person starts speaking (via diarization)
    • Topic shifts — Detected from the transcript using the same embedding-based approach as text
    • Silence boundaries — Pauses in audio often correspond to topic transitions

    Each scene becomes a retrieval unit with its own embedding, transcript chunk, and metadata. This enables frame-accurate search results rather than returning entire videos.

    Audio Chunking

    Audio content (podcasts, calls, meetings) benefits from transcript-based semantic chunking combined with audio signals:

    • Speaker turns — Split when the speaker changes
    • Topic boundaries — Semantic chunking on the transcript
    • Silence detection — Pauses longer than a threshold indicate segment boundaries
    • Music/jingle detection — In podcasts, musical interludes separate segments

    Chunk Size Guidelines

    Content TypeTarget Chunk SizeRationale
    General text200-500 tokensBalances specificity and context
    Technical docs300-800 tokensPreserves code examples and explanations
    FAQs1 Q&A pair per chunkEach pair is a complete retrieval unit
    Video scenes10-60 secondsLong enough for context, short enough for relevance
    Audio segments30-120 secondsAligns with natural speech patterns

    Measuring Chunking Quality

    Evaluate your chunking strategy by measuring downstream retrieval quality:

    • Chunk coherence — Do chunks contain complete, self-contained ideas?
    • Retrieval recall — When you search for a known answer, does the relevant chunk appear in the top results?
    • Answer quality — Does the LLM produce better answers when using semantically chunked context vs. fixed-size chunks?

    In our testing, semantic chunking typically improves retrieval recall by 15-25% compared to fixed-size chunking, with the largest gains on long documents with multiple topics.

    Learn more about chunking in our glossary entry on semantic chunking, or explore document understanding for more on layout-aware processing.