Multimodal Chunking Strategies: How to Decompose Video, Audio, Images, and Documents for Search

Four chunking strategies applied to the same document: fixed-size cuts mid-sentence, semantic splits on topic shifts, hierarchical nests children under retrievable parents, document-aware respects headings, code blocks, and tables.

Why Chunking Is the First Decision That Matters

Every retrieval system answers the same question: given a query, which pieces of content are relevant? The quality of the answer depends entirely on what constitutes a "piece." If pieces are too large, retrieval returns bloated results where the relevant signal is buried in noise. If too small, individual pieces lack the context needed for an LLM to generate a useful response. If pieces break across semantic boundaries, you get fragments that are neither self-contained nor representative.

For text, the chunking problem is well-studied. Split on paragraphs or sentences, use a sliding window with overlap, maybe embed each chunk and merge adjacent chunks with high cosine similarity. Libraries like LangChain and LlamaIndex ship with a dozen text splitters. The problem is largely solved.

For video, audio, images, and documents, it is not. Each modality has different structure, different information density, different natural boundaries, and different failure modes when you chunk incorrectly. A 45-minute meeting recording has no paragraph breaks. A scanned invoice has spatial layout that a linear text splitter destroys. A product image contains multiple regions of interest at different scales. A podcast has speaker turns, topic shifts, and background music that all demand different boundary strategies.

This guide covers the chunking algorithms and trade-offs for each modality, then addresses the hardest problem of all: aligning chunks across modalities so that a single retrieval unit contains the video frames, the transcript, and the audio features that correspond to the same moment.

Text Chunking: The Baseline

Text chunking is the foundation because most multimodal pipelines eventually produce text (transcripts, captions, OCR output) that must be chunked for retrieval.

Fixed-Size Windowing

The simplest approach: split text into N-token windows with M tokens of overlap. Common defaults are 512 tokens with 50-token overlap.

[tokens 0..511] [tokens 462..973] [tokens 924..1435] ...

When it works: Uniform-density text like technical documentation where every paragraph has roughly equal information content. Benchmarks show 85--89% recall at top-5 depending on token size.

When it fails: Splits mid-sentence, mid-paragraph, or mid-argument. A chunk boundary that falls between "The system crashes when..." and "...the database connection pool is exhausted" produces two useless chunks instead of one useful one.

Recursive Character Splitting

Split on the largest structural delimiter first (double newline), then fall back to smaller delimiters (single newline, period, comma, space) until chunks fit the size budget. LangChain's RecursiveCharacterTextSplitter popularized this approach.

When it works: Well-formatted documents with consistent structure (Markdown, HTML, structured reports).

When it fails: Unstructured text like raw transcripts or OCR output where newlines are arbitrary. Also fails on documents where structural boundaries do not align with semantic boundaries -- a section header might be on its own line but semantically belongs with the paragraph below it.

Semantic Chunking

Embed each sentence, then group adjacent sentences whose embeddings are similar. When the cosine similarity between consecutive sentence embeddings drops below a threshold, start a new chunk.

The algorithm:

1. Split text into sentences 2. Embed each sentence (a lightweight model like bge-small is fast enough) 3. Compute cosine similarity between consecutive sentence embeddings 4. When similarity drops below a threshold (typically the 80th--95th percentile of all pairwise similarities in the document), insert a chunk boundary 5. Merge resulting groups into chunks, applying a maximum size constraint

Benchmarks show LLMSemanticChunker reaching 0.919 recall and ClusterSemanticChunker reaching 0.913 -- roughly 3--7% above recursive splitting. However, NVIDIA's 2024 benchmark found that page-level chunking with no splitting at all achieved the lowest variance across document types, suggesting the optimal strategy depends heavily on document structure.

Strengths: Produces chunks that are semantically coherent -- each chunk is about one topic or idea.

Weaknesses: Requires an embedding forward pass for every sentence before chunking begins. This is fast for a single document but adds latency at scale. Also, embedding model choice has a larger impact on retrieval quality than chunking strategy -- spending effort on a better embedding model often outperforms spending effort on better chunking.

Late Chunking

Jina AI introduced late chunking, which addresses a fundamental problem: traditional chunking embeds each chunk independently, losing context from surrounding text. A sentence like "He refused the offer" is ambiguous without knowing who "he" is.

Late chunking works by:

1. Pass the entire document through a long-context embedding model (e.g., jina-embeddings-v3 supports 8192 tokens) 2. Apply mean pooling over token ranges corresponding to each chunk 3. Each chunk embedding now incorporates information from the full document context

A related approach is Anthropic's contextual retrieval (2024): prepend a document-level summary to each chunk before embedding, reducing retrieval failures by 35% in their benchmarks. Both methods address the same root cause -- chunks that lose context from their surrounding document.

Video Chunking: Temporal Decomposition

Video is a continuous stream with no inherent structure. The first task is converting it into discrete segments that are meaningful for retrieval.

Fixed-Interval Windowing

Split video into N-second windows (e.g., every 10 seconds). Extract keyframes from each window and treat each window as a chunk.

When it works: Surveillance footage, dashcam recordings, and other continuous capture where scenes change gradually and there are no editorial cuts.

When it fails: Edited content with variable shot lengths. A 10-second window might straddle a scene boundary, producing a chunk that shows the end of a kitchen scene and the beginning of a car chase -- semantically incoherent.

Shot Boundary Detection

Detect camera cuts by measuring visual similarity between consecutive frames. When the similarity drops sharply, a new shot begins.

Three levels of sophistication:

Histogram difference: Compare color histograms between frames. Chi-squared distance or Bhattacharyya distance above a threshold indicates a cut. Fast (>1000fps on CPU) but misses gradual transitions like dissolves and fades.

Adaptive thresholds: Maintain a rolling window of frame differences and trigger on deviations from the local mean. PySceneDetect's ContentDetector uses this approach -- it adapts to both static dialogue scenes (low baseline variation) and action sequences (high baseline variation). F1 below 0.6 on gradual transitions is the main limitation.

Learned detection: TransNetV2 is a dilated 3D-CNN trained on ClipShots and BBC Planet Earth datasets. It processes ~200fps on GPU and achieves F1 above 0.95 on hard cuts, dissolves, and fades. AutoShot (2023) improves on TransNetV2 by 4.2% F1 on its benchmark.

# PySceneDetect adaptive detection
from scenedetect import open_video, SceneManager
from scenedetect.detectors import ContentDetector

video = open_video("input.mp4")
scene_manager = SceneManager()
scene_manager.add_detector(ContentDetector(threshold=27.0))
scene_manager.detect_scenes(video)
scenes = scene_manager.get_scene_list()
# scenes = [(start_time, end_time), ...]

Scene Grouping

Shots are too granular for most retrieval tasks. A conversation between two people might alternate between camera angles every 2-3 seconds, producing dozens of shots that all belong to a single semantic scene.

Scene grouping merges consecutive shots into semantic scenes:

1. Detect shot boundaries 2. Extract a representative feature vector from each shot (e.g., average CLIP embedding of sampled frames) 3. Merge consecutive shots whose feature similarity exceeds a threshold 4. Apply minimum/maximum scene duration constraints

This reduces a 30-minute video from hundreds of shots to 15--30 scenes, each representing a coherent narrative unit.

Keyframe Selection

Once scenes are defined, you need representative frames for embedding and captioning:

I-frame extraction: Use the codec's intra-coded frames. Fast and deterministic, but keyframe placement is driven by compression heuristics, not content

Content-based scoring: Weight frames by perceptual sharpness, luminance, and temporal diversity within the scene. Select the top K

K-means on frame embeddings: Embed all frames with a lightweight CNN, cluster into K groups, select the frame closest to each centroid. Produces diverse keyframes that cover visual variety within a scene

For most production systems, 1--3 keyframes per scene provides a good balance between coverage and storage cost.

Audio Chunking: Boundaries from Speech and Sound

Audio chunking strategies depend on whether you are processing speech, music, or environmental sound.

Voice Activity Detection (VAD)

VAD models distinguish speech from non-speech with higher accuracy than energy thresholds. Silero VAD 4.0 (~1M parameters, ONNX-exportable) runs at 100x real-time on CPU and is the de facto standard.

import torch
model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, _, _, _) = utils
speech_timestamps = get_speech_timestamps(
    audio_tensor, model, sampling_rate=16000
)
# speech_timestamps = [{'start': 0, 'end': 48000}, ...]

The speech timestamps define natural chunk boundaries. Non-speech gaps between speech segments are where you split.

Transcript-Aligned Chunking with WhisperX

WhisperX chains Silero VAD, Whisper, and phoneme forced alignment (wav2vec2) to produce word-level timestamps accurate to ~50ms. This enables transcript-aligned chunking:

1. VAD segments audio into ~30-second speech regions with 1-second overlap 2. Whisper transcribes each region in parallel 3. Forced alignment maps each word to its exact audio timestamp 4. Semantic chunking on the transcript text determines chunk boundaries 5. Word timestamps map those text boundaries back to audio time codes

The result: chunks defined by topic boundaries in the transcript, with precise audio timestamps for each chunk.

Speaker Diarization as Chunk Boundaries

For multi-speaker content, chunk boundaries should align with speaker turns. pyannote.audio 3.x uses an ECAPA-TDNN speaker encoder with clustering to identify who speaks when.

When combined with ASR, each chunk contains the transcript, the speaker label, and the start/end timestamps: "Speaker A said '...' from 01:23 to 01:47." This produces rich retrieval units for meeting and interview search.

For single-speaker content (lectures, podcasts), speaker diarization provides no boundaries. Fall back to topic-based segmentation via semantic chunking of the transcript.

Music and Speech Separation

When audio contains both speech and music (podcasts with intro music, videos with soundtracks), separate them before chunking. Demucs v4 (hybrid transformer architecture) separates vocals from accompaniment at 44.1kHz. Speech chunks and music chunks are then processed with different strategies -- speech goes through VAD and transcription, music through audio fingerprinting or embedding.

Document Chunking: Layout-Aware Parsing

PDFs, slides, and scanned documents have spatial layout that carries semantic meaning. A two-column academic paper, a table in a financial report, and a figure with a caption all have structure that linear text extraction destroys.

The Naive Approach and Its Failures

# What most people do first
text = extract_text(pdf_path)
chunks = text_splitter.split_text(text)

This flattens the document into a single text stream. Problems:

Table destruction. A table becomes interleaved text: "Revenue 2024 Expenses 2024 Revenue 2025 Expenses 2025..." -- the row/column structure is lost, and the chunk is meaningless

Multi-column misalignment. Two-column layouts produce text that alternates between columns, creating nonsense at column boundaries

Figure/caption separation. A figure caption ends up in a different chunk than the figure itself

Layout-Aware Parsing

Modern document parsing uses vision models to understand page layout before extracting text:

LayoutLMv3 and DiT (Document Image Transformer) detect document elements -- title, paragraph, table, figure, caption, header, footer -- from the page image, producing bounding boxes with element type labels.

Docling (IBM, open-source) provides end-to-end parsing: PDF in, structured JSON out. It uses layout detection to identify elements, then applies specialized extractors for each type.

Marker converts PDFs to Markdown with layout preservation, handling multi-column layouts, headers, tables, and equations.

The chunking strategy for layout-parsed documents:

1. Parse the document into structural elements (title, section, paragraph, table, figure, list) 2. Group elements by section (a section heading and its child paragraphs form a unit) 3. If a section exceeds the chunk size budget, split at paragraph boundaries within the section 4. Keep tables as single chunks -- never split a table across chunk boundaries 5. Attach figure captions to their figures as a single chunk

Page-Level Visual Embedding

An alternative: treat each page as a chunk and embed the page image directly. ColPali uses late interaction (similar to ColBERT for text) to embed document pages visually. At query time, the query is compared against page-level representations without any OCR.

When page-level works: Fixed-format documents where each page is self-contained (slides, forms, invoices), or when queries target visual elements (charts, diagrams) that OCR cannot capture.

When section-level is better: Long-form documents where relevant information spans less than a page and you need sub-page retrieval precision.

Image Chunking: Spatial Decomposition

A single image is usually one retrieval unit. But high-resolution images, panoramas, and dense scenes benefit from spatial decomposition.

Region-of-Interest Detection

Object detection models (YOLO, Grounding DINO) identify specific regions within an image. Each detected object becomes a sub-image chunk with its own embedding.

For a warehouse inventory image, this means each individual product gets its own embedding rather than one embedding for the entire shelf. Queries like "red box with barcode 12345" can match the specific region.

Fixed-Grid Tiling

Split high-resolution images into overlapping tiles of fixed size. A 4096x4096 image becomes a grid of 512x512 tiles with 64 pixels of overlap.

This is also how modern vision-language models handle high resolution internally. LLaVA-NeXT, InternVL, and DeepSeek-VL2 split input images into tiles matching the vision encoder's native resolution (typically 336x336 or 448x448), processing a global thumbnail alongside tiles for context. HiRes-LLaVA (2025) addresses the "fragmentation problem" where naive tiling destroys spatial continuity across tile borders.

When it works: Satellite imagery, pathology slides, and other high-resolution content where relevant detail occupies a small region.

When it fails: Objects that span tile boundaries are split across chunks. Overlap mitigates this but does not eliminate it.

Multi-Scale Embedding

Embed the same image at multiple crop levels: full image (global context), center crop (main subject), quadrant crops (regional detail), and object-detection crops (specific elements). Store all embeddings linked to the same source image.

A query about the overall scene matches the full-image embedding; a query about a specific detail matches a crop embedding. All embeddings point back to the same image with spatial annotations.

Cross-Modal Alignment: The Hard Problem

A video segment is not just a set of frames. It is frames plus audio plus transcript plus metadata, all occurring in the same time window. If you chunk each modality independently, the boundaries will not align: scene boundaries do not coincide with speaker turns, which do not coincide with topic shifts in the transcript.

Misaligned chunks create retrieval failures. A query about "the CEO announcing the acquisition" might match the transcript chunk (which mentions the acquisition) but not the corresponding video chunk (which shows the CEO). If these are stored as separate, unlinked retrieval units, the system returns incomplete results.

The Temporal Manifest Pattern

The solution is a temporal manifest: a data structure that links all modality-specific representations to a shared timeline.

{
  "segment_id": "seg_042",
  "source": "quarterly_earnings.mp4",
  "time_range": {"start": 847.3, "end": 892.1},
  "visual": {
    "keyframes": ["kf_042a.jpg", "kf_042b.jpg"],
    "scene_embedding": [0.12, -0.34, "..."],
    "caption": "CEO at podium, slides showing acquisition details"
  },
  "audio": {
    "speech_embedding": [0.08, 0.41, "..."],
    "speaker": "speaker_01",
    "transcript": "We are announcing our acquisition of Acme Corp..."
  },
  "text": {
    "transcript_embedding": [-0.22, 0.15, "..."],
    "entities": ["Acme Corp", "$2.3 billion"],
    "topic": "acquisition_announcement"
  }
}

Each segment in the manifest is a multimodal chunk: a single retrieval unit that contains all modality representations for one time window. The embedding index stores one or more vectors per segment (visual, audio, text), all pointing to the same segment ID. When any modality matches a query, the system returns the full multimodal segment.

Choosing the Primary Modality

One modality must define the chunk boundaries. The others align to those boundaries after the fact.

Visual-primary: Scene boundaries from shot detection define chunks. Audio and transcript are clipped to match. Best for edited video content where visual cuts are the natural structure.

Audio-primary: Speaker turns or silence gaps define chunks. Video frames and transcript are clipped to match. Best for meetings, interviews, and conversations where who is speaking matters more than what is on screen.

Transcript-primary: Topic boundaries from semantic chunking of the transcript define chunks. Video and audio are clipped to match. Best for lectures and presentations where the spoken content carries the primary information.

There is no universal best choice. The right primary modality depends on what your users search for. Visual queries need visual-primary chunking. Spoken-content queries need transcript-primary chunking. If you do not know your query distribution in advance, start with transcript-primary -- text-based retrieval is the most mature and forgiving.

Overlap Across Modalities

Fixed overlap (e.g., 2 seconds at each boundary) helps retrieval near chunk edges. For multimodal chunks, overlap must be applied consistently across all modalities: include frames, audio, and transcript words from the overlap window in both adjacent chunks. This ensures content near boundaries is retrievable regardless of which chunk the system returns.

Chunk Size Trade-Offs

Granularity

Video

Text

Pros

Cons

Fine	2--5 sec	100--200 tokens	High precision per chunk	Loses context, more chunks to index
Medium	10--15 sec	300--500 tokens	Good precision/recall balance	Moderate index size
Coarse	30--60 sec	500--1000 tokens	Rich context per chunk	Relevant content diluted by noise

The optimal size depends on query specificity. Specific queries ("what time did the fire alarm go off?") need fine-grained chunks. Broad queries ("summarize the meeting discussion about Q3 targets") benefit from larger chunks.

A practical approach: chunk at medium granularity and use hierarchical retrieval. Store both medium chunks and their parent segments (full scenes, full sections). Retrieve at the fine level for precision, then expand to the parent level for context.

Putting It Together with Mixpeek

Mixpeek handles multimodal chunking as part of the ingest pipeline. When you ingest a video, the platform performs scene segmentation, keyframe extraction, transcription, and temporal alignment automatically -- producing the multimodal chunk structure described above.

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Ingest a video with multimodal feature extraction
# Mixpeek handles chunking, alignment, and indexing
result = client.ingest.from_url(
    url="s3://media/quarterly-earnings.mp4",
    collection="earnings_calls",
    feature_extractors=[
        {
            "type": "embed",
            "model": "mixpeek://embed@v1/clip_large_v1"
        },
        {
            "type": "transcribe",
            "model": "mixpeek://audio_extractor@v1/whisper_large_v3"
        },
        {
            "type": "caption",
            "model": "mixpeek://video_extractor@v1/omni_tarsier2_7b_v1"
        }
    ]
)

Each ingested video produces temporally-aligned chunks with visual embeddings, transcript text, and scene captions. When you search, any modality can match:

results = client.search.text(
    collection="earnings_calls",
    query="acquisition announcement price and target",
    pipeline=[{
        "stage_type": "search",
        "stage_id": "semantic",
        "model": "mixpeek://text_extractor@v1/baai_bge_large_v1",
        "limit": 10
    }]
)

# Each result includes the full multimodal chunk:
# - video segment with start/end timestamps
# - transcript of that segment
# - keyframe thumbnails
# - scene description

For documents, Mixpeek applies layout-aware parsing before chunking:

result = client.ingest.from_url(
    url="s3://documents/annual-report.pdf",
    collection="financial_docs",
    feature_extractors=[
        {
            "type": "ocr",
            "model": "mixpeek://image_extractor@v1/rednote_dots_ocr_15_v1"
        },
        {
            "type": "embed",
            "model": "mixpeek://embed@v1/clip_large_v1"
        }
    ]
)

Tables remain intact as single chunks, figures are linked to their captions, and section boundaries are preserved -- avoiding the linearization problems described earlier.