Beyond Text Splitting

Chunking Strategies for Multimodal RAG

Chunking is the single biggest lever for retrieval quality. But most guides only cover text. Learn how to chunk video into scenes, images into regions, audio into speaker segments, and documents into layout-aware sections.

Why Chunking Matters

The quality of your RAG pipeline is determined by the quality of your chunks. Bad chunking means bad retrieval — no amount of reranking or prompt engineering can fix it.

40%+

retrieval accuracy improvement

Switching from naive fixed-size chunking to semantic or layout-aware strategies typically improves retrieval precision by 40% or more on complex documents.

80%

of enterprise data is non-text

Video, images, audio, and complex documents make up the majority of enterprise data — yet most chunking guides only cover plain text splitting.

1 Query

across all chunk types

With multimodal chunking, a single retrieval query can surface relevant video scenes, image regions, document sections, and audio segments together.

Text Chunking Strategies

The foundation. Every RAG pipeline needs a text chunking strategy — here are the four main approaches and when to use each.

Fixed-Size Chunking

Split text into chunks of a fixed token or character count with optional overlap. Simple, predictable, and works well for homogeneous documents.

Pros

Easy to implement
Predictable chunk sizes
Fast processing

Cons

—Splits mid-sentence or mid-paragraph
—No semantic awareness
—Poor for structured documents

Best for: Uniform text documents, logs, transcripts

Semantic Chunking

Use embedding similarity to detect natural topic boundaries. Adjacent sentences are grouped until the semantic similarity drops below a threshold.

Pros

Preserves meaning within chunks
Adapts to content structure
Better retrieval accuracy

Cons

—Requires embedding model
—Slower than fixed-size
—Variable chunk sizes

Best for: Research papers, articles, knowledge bases

Recursive / Hierarchical Chunking

Split by document structure first (sections, paragraphs, sentences), then fall back to fixed-size splits if chunks are still too large. Preserves document hierarchy.

Pros

Respects document structure
Handles nested content
Good balance of size and meaning

Cons

—Requires parsing structure
—More complex implementation
—Depends on consistent formatting

Best for: Technical docs, legal contracts, manuals

Layout-Aware Chunking

Parse document layout (tables, headers, figures, columns) and chunk by visual/structural regions. Essential for PDFs and scanned documents where text flow isn't linear.

Pros

Handles tables and figures correctly
Works with scanned documents
Preserves spatial relationships

Cons

—Requires OCR and layout detection
—Computationally expensive
—Complex for multi-column layouts

Best for: PDFs, invoices, forms, scanned documents

Multimodal Chunking

Text chunking is table stakes. Real-world data includes video, images, audio, and complex documents — each requiring modality-specific chunking strategies.

Video

Scene & Segment Chunking

Videos aren't text — you can't split them by character count. Mixpeek decomposes video into semantically meaningful segments using scene detection, shot boundary analysis, and temporal embedding similarity.

Scene Detection

Detect visual scene changes using frame-level embeddings. Each scene becomes a chunk with its own embedding, transcript, and metadata.

Fixed Interval

Split video into uniform time windows (e.g., every 30 seconds). Simple but effective for surveillance footage, lectures, and live streams.

Speaker Turns

Chunk by speaker diarization boundaries. Each speaker turn becomes a segment — ideal for meetings, podcasts, and interviews.

Action Segments

Use activity recognition to detect action boundaries. Each distinct action or event becomes its own retrievable chunk.

Images

Region & Object Chunking

A single image can contain multiple retrievable concepts. Mixpeek extracts regions, objects, text zones, and faces — each as its own searchable chunk with coordinates and embeddings.

Object Detection

Detect and crop individual objects. Each object gets its own embedding and bounding box — search for 'fire extinguisher' and find the exact region.

Text Regions (OCR)

Extract text zones from images — signs, labels, documents within photos. Each text region becomes a searchable chunk.

Tile Grid

Split large images (satellite, microscopy, art) into overlapping tiles. Each tile gets embedded independently for fine-grained spatial search.

Face Crops

Detect and extract face regions for identity search, content moderation, or demographic analysis. Each face is a separate chunk.

Documents

Structure-Aware Chunking

Documents have internal structure — sections, tables, figures, footnotes. Mixpeek parses layout and chunks by semantic regions, not just text flow.

Section-Based

Parse heading hierarchy and split by sections. Each section retains its heading chain as context — 'Chapter 3 > 3.2 Risk Factors > Market Risk'.

Table Extraction

Detect and extract tables as standalone chunks with structured data. Tables are embedded as both visual and textual representations.

Figure + Caption

Extract embedded figures with their captions. Each figure-caption pair becomes a multimodal chunk — visual embedding + text embedding.

Paragraph Sliding Window

Slide a window across paragraphs with configurable overlap. Simpler than full layout parsing but respects paragraph boundaries.

Audio

Temporal & Speaker Chunking

Audio content needs chunking strategies that respect speech patterns, speaker changes, and silence boundaries — not arbitrary time splits.

Speaker Diarization

Identify who spoke when and chunk by speaker turns. Each chunk includes speaker ID, transcript, and audio embedding.

Silence Detection

Split audio at natural pause boundaries. Effective for podcasts, lectures, and dictations where pauses signal topic shifts.

Semantic Audio Segments

Use transcript embeddings to detect topic boundaries in speech, grouping contiguous utterances by semantic similarity.

Fixed Duration Windows

Split into uniform time windows (e.g., 60-second segments). Works well for ambient audio, call center recordings, and continuous monitoring.

Chunking Support: Mixpeek vs. Alternatives

Most tools only chunk text. Mixpeek chunks every modality and handles the full pipeline — chunking, embedding, indexing, and retrieval.

Capability	Mixpeek	Pinecone	LangChain	LlamaIndex
Text Chunking	All text strategies + layout-aware document parsing	Comprehensive guide (fixed, recursive, semantic)	Built-in text splitters (8+ strategies)	Node parsers (sentence, semantic, hierarchical)
Video Chunking	Scene detection, shot boundaries, speaker turns, action segments	Not covered	Not supported	Not supported
Image Chunking	Object detection, OCR regions, tile grids, face crops	Not covered	Not supported	Not supported
Audio Chunking	Speaker diarization, silence detection, semantic segments	Not covered	Not supported	Not supported
Cross-Modal Context	Chunks from all modalities in one index with cross-modal retrieval	N/A (text only)	Manual integration	Limited (some multimodal nodes)
Infrastructure	Managed pipeline — chunking, embedding, indexing, retrieval	Guide only (BYO code)	Framework (BYO infrastructure)	Framework (BYO infrastructure)

Multimodal Chunking in One API

Define chunking strategies per modality. Mixpeek handles the rest — extraction, embedding, indexing, and retrieval.

chunking_pipeline.py

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Define a collection with multimodal chunking strategies
collection = client.collections.create(
    namespace_id="ns_abc123",
    bucket_id="bucket_xyz",
    extractors=[
        # Document chunking: layout-aware with semantic fallback
        {
            "type": "text_embedding",
            "chunking": {
                "strategy": "layout_aware",
                "fallback": "semantic",
                "max_tokens": 512,
                "overlap": 50
            }
        },
        # Video chunking: scene detection with transcription
        {
            "type": "video_keyframe",
            "chunking": {
                "strategy": "scene_detection",
                "min_scene_duration": 2.0,
                "embed_frames": True
            }
        },
        {
            "type": "audio_transcription",
            "chunking": {
                "strategy": "speaker_diarization",
                "merge_short_turns": True
            }
        },
        # Image chunking: object detection + OCR regions
        {
            "type": "object_detection",
            "chunking": {
                "strategy": "per_object",
                "min_confidence": 0.7
            }
        },
        {"type": "ocr"}
    ]
)

# Every file uploaded is automatically chunked, embedded, and indexed.
# Query across all chunk types with one retriever:
results = client.retrievers.execute(
    namespace_id="ns_abc123",
    stages=[
        {
            "type": "feature_search",
            "method": "hybrid",
            "query": {"text": "safety violation near conveyor belt"},
            "limit": 20
        },
        {"type": "rerank", "model": "cross-encoder", "limit": 5}
    ]
)

# Results span video scenes, image regions, document sections, audio segments
for r in results:
    print(f"{r.modality} chunk: {r.content[:80]}  (score: {r.score})")

Chunking Best Practices

Practical guidelines for getting the most out of your chunking strategy, regardless of modality.

Right-Size Your Chunks

Text: 256-512 tokens for Q&A, 512-1024 for summarization. Video: 5-30 second scenes. Audio: speaker turns or 30-60s segments. Match chunk size to your retrieval use case.

Use Overlap Wisely

10-20% overlap for fixed-size text chunks. Not needed for semantic, layout-aware, or modality-specific chunking where boundaries are naturally meaningful.

Preserve Hierarchy

Attach parent context to each chunk — section headings for documents, video title for scenes, speaker identity for audio segments. Context improves embedding quality.

Match Strategy to Data

Don't use one strategy for everything. PDFs need layout-aware parsing. Markdown needs recursive splitting. Video needs scene detection. Configure per-modality.

Evaluate End-to-End

Test chunking strategies by their downstream retrieval quality, not in isolation. The best chunking strategy is the one that produces the best search results for your queries.

Index All Modalities Together

Chunks from different modalities should land in the same index. A text query about 'safety violations' should retrieve matching video scenes, image regions, and document passages.

Frequently Asked Questions

What is chunking in the context of RAG and AI retrieval?

Chunking is the process of breaking large documents or media files into smaller, semantically meaningful pieces (chunks) that can be individually embedded and indexed for retrieval. Good chunking ensures that each piece contains enough context to be useful on its own, while being small enough for accurate vector similarity matching. It's a critical step in any RAG pipeline — poor chunking leads to irrelevant retrieval results.

What is the best chunking strategy for RAG?

There is no single best strategy — it depends on your data. For well-structured text documents, recursive/hierarchical chunking that respects heading structure works well. For unstructured text, semantic chunking based on embedding similarity produces the most coherent chunks. For PDFs and scanned documents, layout-aware chunking is essential. For multimodal data (video, images, audio), you need modality-specific strategies like scene detection, object cropping, and speaker diarization.

What chunk size should I use?

For text, 256-512 tokens is a common sweet spot — large enough for context, small enough for precise matching. But the right size depends on your use case: factual Q&A benefits from smaller chunks (128-256 tokens), while summarization and analysis work better with larger chunks (512-1024 tokens). For non-text modalities, chunk 'size' is defined differently — video scenes might be 5-30 seconds, image regions are defined by bounding boxes, and audio segments are determined by speaker turns or silence boundaries.

What is multimodal chunking?

Multimodal chunking extends traditional text chunking to work across all data types. Instead of only splitting text into pieces, multimodal chunking decomposes video into scenes, images into regions, audio into speaker segments, and documents into layout-aware sections. Each chunk gets its own embedding and can be retrieved independently. This is essential for building RAG systems that work with real-world enterprise data, which is overwhelmingly non-text.

How does Mixpeek handle chunking differently from LangChain or LlamaIndex?

LangChain and LlamaIndex provide text splitter utilities that you run in your own code. Mixpeek is managed infrastructure that handles chunking as part of an end-to-end pipeline — from file ingestion through chunking, embedding, indexing, and retrieval. More importantly, Mixpeek natively chunks video (scene detection), images (object detection), and audio (speaker diarization) — modalities that text-only frameworks don't address at all.

Should I use overlapping chunks?

Overlap helps prevent important context from being split across chunk boundaries. For fixed-size text chunks, 10-20% overlap (e.g., 50-100 tokens for a 512-token chunk) is a good starting point. Semantic and recursive chunking strategies are less affected by boundary issues since they split at natural boundaries. For video and audio, overlap is less common — scenes and speaker turns have natural boundaries that don't benefit from artificial overlap.

How does chunking affect retrieval quality?

Chunking is the single biggest lever for retrieval quality. Chunks that are too large dilute the embedding with irrelevant content, reducing precision. Chunks that are too small lose context, reducing recall. Chunks that split mid-thought or mid-table produce incoherent embeddings. The goal is chunks that represent exactly one retrievable concept — a complete thought, a single table, one scene, one speaker turn — so the embedding accurately represents what's in the chunk.

Can I use different chunking strategies for different file types?

Yes, and you should. Mixpeek lets you define different extractors with different chunking configurations per collection. PDFs might use layout-aware chunking while plain text uses semantic chunking. Videos use scene detection while audio uses speaker diarization. All chunks from all modalities land in the same index and are retrievable through a single query — the chunking strategy is per-modality, but retrieval is unified.

Stop chunking text. Start chunking everything.

Mixpeek handles multimodal chunking, embedding, indexing, and retrieval as a managed pipeline. Define your strategy, we run the infrastructure.