Query Preprocessing: Semantic Search With Large Files

The problem: your query is bigger than your embeddings

Most vector search systems assume queries are small. A sentence. An image. A short audio clip. The entire retrieval literature is built around this assumption: you embed a query into a single vector, search against an index of many vectors, return ranked results.

This works until a user hands you a 500MB video and says "find me everything in my library that looks like this."

We started seeing this pattern from multiple customers in Q4 2025. A media company wanted to search their archive using a raw broadcast clip. A legal team wanted to submit a full contract PDF as a query against a corpus of prior agreements. An IP safety product (which we also built) needed to scan uploaded videos for trademark violations by searching frame-by-frame against a brand index.

The naive solutions all have obvious problems:

Reject large inputs — forces the client to pre-split, which breaks the API abstraction and requires them to implement fusion logic
Average all frame embeddings into one vector — destroys temporal structure. A 10-minute video becomes one meaningless centroid.
Limit query size — a 100MB video limit is arbitrary and still doesn't solve the composition problem

What we wanted: pass a large file directly as a query input, have the system figure out how to search with it, get back a ranked list as if it were a simple query.

The insight: ingestion and query are the same operation

Here's the key observation that made this tractable: the decomposition logic we already use for ingestion is exactly what we need for query preprocessing.

When a video gets ingested into Mixpeek, it goes through a feature extractor that:

Splits the video into segments (keyframes, fixed intervals, or scene boundaries)
Embeds each segment via the configured model
Stores the resulting vectors in Qdrant alongside payload metadata

Query preprocessing is the same pipeline, just routing the output differently. Instead of writing vectors to Qdrant, we use them to search Qdrant. The same extractor, the same chunking logic, the same embedding model. This matters because it guarantees that query embeddings and index embeddings are always in the same vector space — no distribution shift from using a different chunking strategy at query time.

The execution flow looks like this:

feature_search stage
│
├─ 1. Detect input type
│     → video/500MB detected
│     → route to query_preprocessing
│
├─ 2. Decompose via extractor pipeline
│     → same extractor that indexed the data
│     → e.g. 20 keyframes from a 10-min video
│
├─ 3. Batch embed (parallel)
│     → 20 segments → inference service → 20 vectors
│
├─ 4. Parallel Qdrant searches
│     → 20 concurrent ANN queries
│     → each returns top_k candidates
│
├─ 5. Fuse results
│     → RRF / max / avg across 20 result sets
│     → deduplicate (same doc from multiple frames → keep best)
│
└─ Output: single ranked list, same shape as a simple query response

From the caller's perspective, nothing changes. You pass a file URL, you get results back. The complexity is entirely internal.

API design

We added a query_preprocessing object to the feature_search stage. It can live at the stage level (applies to all searches as a default) or per-search (overrides the default for that search).

Zero-config usage — just pass a large file and the system figures out the rest:

{
  "stage_id": "feature_search",
  "parameters": {
    "searches": [{
      "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
      "query": {
        "input_mode": "content",
        "value": "s3://my-bucket/broadcast-clip.mp4"
      },
      "query_preprocessing": {
        "max_chunks": 20,
        "aggregation": "rrf"
      }
    }]
  }
}

The params field is the extractor's own parameter schema

This is the part that surprised people internally when we first described it: query_preprocessing.params is not a new configuration surface. It is literally the same parameter schema that the extractor accepts during ingestion.

Whatever you put in a collection's extractor config for multimodal_extractor@v1 — video_interval_seconds, max_resolution, keyframe_threshold, whatever that extractor exposes — those same keys go in params here. The preprocessing step runs the extractor with those params to decompose the query, exactly as it would during collection processing. Same code path, same config schema, different output destination.

{
  "stage_id": "feature_search",
  "parameters": {
    "searches": [{
      "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
      "query": { "input_mode": "content", "value": "{{INPUT.video}}" },
      "query_preprocessing": {
        "max_chunks": 30,
        "aggregation": "max",
        "dedup_field": "metadata.document_id",
        "params": {
          "split_method": "time",
          "time_split_interval": 5
        }
      }
    }]
  }
}

This means there's nothing new to learn about the preprocessing parameters. If you know how to configure the extractor for ingestion, you already know how to configure it for query preprocessing. The extractor documentation is the reference for both.

Per-search preprocessing, mixed with a plain text query:

{
  "stage_id": "feature_search",
  "parameters": {
    "searches": [
      {
        "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
        "query": { "input_mode": "content", "value": "{{INPUT.video}}" },
        "query_preprocessing": {
          "max_chunks": 30,
          "aggregation": "max"
        }
      },
      {
        "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
        "query": { "input_mode": "text", "value": "{{INPUT.caption}}" }
      }
    ]
  }
}

The second search has no preprocessing — it's a plain single-vector text query. Multi-modal retrieval with heterogeneous query types, fused at the stage level.

Fusion strategies

Once you have N result sets from N chunk searches, you need to combine them. We support three strategies:

RRF (Reciprocal Rank Fusion)

Each document's score is the sum of 1 / (k + rank) across all chunk result sets where it appeared. k is a smoothing constant (typically 60).

RRF is rank-based, so it's immune to score magnitude differences between chunks. A document that ranks 3rd in 5 different chunk searches beats one that ranks 1st in only 1. This is the right default for "find content that's generally similar to this video" queries.

Max

Keep the highest score a document received across all chunk searches. Use this when you want "find the moment in this video that best matches something in the index" — you care about the best alignment, not average alignment.

Avg

Average the scores across all chunk results where the document appeared. Documents that show up consistently across many chunks beat documents that match one chunk perfectly. Useful for "find videos with similar overall content distribution."

The right strategy depends on the query semantics. For IP safety (does this video contain a specific brand?), max is correct — you want the single best match. For "find content similar to this video," rrf is more robust.

What we didn't do: a "strategy: auto" mode

Early in the design we considered a strategy: "auto" parameter that would detect file size and type and choose chunking parameters automatically. We prototyped it.

The problem is that the right chunking depends on what you're trying to find, not just the file. A 5-second clip queried against a movie archive probably wants dense keyframe sampling. The same clip queried against a sports highlight reel probably wants scene-boundary splits. There's no way to infer this from the file alone.

We removed auto mode. If we add it back, it'll be as a starting heuristic with explicit override support — not as a magic setting that hides what's actually happening. The full parameter reference is in the docs.

Credit model

Each chunk counts as one retrieval credit. A max_chunks: 20 config on a video that produces 20 keyframes costs 20 credits, same as running 20 separate single-vector searches. This is intentional — preprocessing is not a way to get bulk search at single-query pricing. The cost is transparent and predictable.

The cap parameter (max_chunks, range 1–100) exists to bound the cost at query time. If an extractor would produce 50 chunks but you set max_chunks: 20, we take the first 20 by default. You can configure the sampling strategy via extractor params if you need uniform sampling instead.

The IP safety case

The use case that drove us to ship this quickly was our IP safety verification pipeline. The product takes a video (a YouTube upload, a broadcast clip, an ad creative) and checks it against a face index (93K embeddings across ~5K identities) and a brand logo index (25K brands).

The query is the video. There's no text query, no image query — you're searching with the entire asset. Before query preprocessing, this required the caller to extract frames, embed them, run searches, and fuse results themselves. Now it's one API call:

{
  "stage_id": "ip_safety_verify",
  "parameters": {
    "face_index_s3_uri": "s3://mixpeek-server-prod/ip-safety/face_index.npz",
    "brand_index_s3_uri": "s3://mixpeek-server-prod/ip-safety/logo_text_index_v2.npz",
    "image_url_field": "metadata.frame_url"
  }
}

The stage handles frame extraction, parallel embedding, and fusion internally. Callers pass a video URL and get back identified faces and brands with confidence scores.

Limitations and known tradeoffs

Latency scales with chunk count. 20 parallel Qdrant searches is fast (we batch the embedding calls), but it's not the same as 1 search. For latency-sensitive paths, set a low max_chunks or pre-extract a representative keyframe.

The extractor must support the input type. Query preprocessing routes through the same extractor pipeline as ingestion. If your namespace uses a text-only extractor, you can't pass a video as a query. The feature URI determines what decomposition is possible.

Chunk ordering is not preserved. The fused result list is ranked by similarity score, not temporal order. If you need results ordered by where in the query video they matched, you'd need to add that as post-processing (we don't have a stage for this yet).

Deduplication is per-field. If two chunks both match the same 5-second clip but from different angles, they'll show up as different results unless you configure dedup_field to collapse by document ID. Know your data model.

What's next

Query preprocessing is live in the feature_search stage today. Full docs here.

The pattern — decompose input, embed in parallel, fuse results — generalizes beyond feature search. The same approach should work in rerank stages (LLM-score each chunk of a large document, take the max) and in apply stages (run a classifier on each frame of a video, return the worst-case result). We haven't built those yet, but the abstraction is the same.

If you're building something where the query is a large file, we'd like to hear about it. The current implementation was shaped almost entirely by real production use cases. The next iteration will be too.

— Mixpeek