Payload Projection for Agentic Vector Search: Field Selection, Evidence Handles, and Context Budgets

Why Payload Size Is an Agent Problem

Vector search usually gets discussed as a ranking problem: embed the query, find nearest neighbors, rerank, return results. That is only half of the system an agent experiences.

An agent does not consume a vector score. It consumes the payload attached to each result. That payload might include transcript text, OCR spans, image URLs, keyframe thumbnails, bounding boxes, model confidence, speaker labels, tenant metadata, access policy, and source lineage.

If the retriever returns too little, the agent cannot answer or cite evidence. If it returns too much, the agent burns context, leaks irrelevant fields into the prompt, increases latency, and makes tool output harder to reason over.

Payload projection is the retrieval-layer control that decides which fields come back for a query.

This matters most for unstructured content because each result can be a dense evidence object:

A video hit may contain a clip URI, keyframe URI, start time, end time, scene caption, OCR text, transcript span, objects, faces, and safety labels.

An audio hit may contain transcript text, speaker labels, timestamps, diarization confidence, language tags, and source handles.

A document hit may contain page image, text block, layout boxes, table cells, formulas, and redaction metadata.

An image hit may contain embeddings, prompt-generated captions, masks, crops, bounding boxes, and provenance.

For agents, retrieval is not only "top 10 nearest vectors." It is "top 10 evidence packets that fit the task."

What Payload Projection Means

Payload projection is field selection for retrieval results. The query still ranks over the indexed representation, but the response returns only the fields the caller asks for.

The basic shape:

{
  "query": "customer asked for a refund after an outage",
  "top_k": 10,
  "select_fields": [
    "source_uri",
    "text",
    "speaker",
    "start_ms",
    "end_ms"
  ]
}

The ranking engine can still use vectors, sparse terms, filters, and metadata. Projection controls the output payload.

A useful way to separate the concerns:

Concern

Question it answers

Example

Ranking	Which items are most relevant?	Vector score, BM25 score, reranker score
Filtering	Which items are allowed?	customer_id, date, content_type, policy label
Projection	Which fields should come back?	source_uri, span text, timestamp, thumbnail
Expansion	What should be fetched after selection?	full transcript window, full image, full PDF page

Projection is not a replacement for ranking. It is the contract between the retriever and the agent.

Projection vs. Filtering vs. Reranking

These operations are often confused because they all appear near the query.

Filtering changes the candidate set. If an agent asks for "refund calls from enterprise customers last week," filters should restrict the search to enterprise accounts and the date window before ranking.

Reranking changes the order. A first-stage vector search may retrieve 200 candidate spans, then a cross-encoder or late-interaction model rescoring the top candidates can improve precision.

Projection changes the returned fields. The retriever may rank over hidden internal fields and still return only a compact evidence envelope.

For example, a video search system may rank over:

CLIP or SigLIP frame embeddings.

ASR transcript embeddings.

OCR sparse terms.

Object labels.

Scene captions.

User metadata.

But the agent may only need:

source_uri

start_ms

end_ms

short_caption

transcript_excerpt

keyframe_url

That output is smaller, cleaner, and easier to cite.

The Evidence Envelope Pattern

Agents work best when retrievers return structured evidence envelopes instead of raw database rows.

{
  "id": "call_481:822180:826920",
  "score": 0.84,
  "evidence": {
    "text": "I would like a refund because the outage affected our launch",
    "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
    "start_ms": 822180,
    "end_ms": 826920,
    "speaker": "customer"
  },
  "expand": {
    "clip_uri": "mixpeek://clips/call_481/822180-826920",
    "nearby_context_uri": "mixpeek://spans/call_481/819000-830000"
  }
}

The envelope separates immediate evidence from expansion handles.

Immediate evidence is what the agent needs to answer now. Expansion handles let the agent fetch more if needed. This is the same idea behind good tool design: return enough structured output to act, but do not dump the entire object graph into the model context.

The envelope should usually contain five field classes.

Field class

Purpose

Examples

Identity	Let the system deduplicate and trace results	id, namespace, object_id, span_id
Citation	Let a human verify the answer	source_uri, page, start_ms, end_ms, keyframe_url
Evidence	Let the model answer	text, caption, ocr_excerpt, object_label
Confidence	Let the model handle uncertainty	score, model_confidence, speaker_overlap
Expansion	Let the agent fetch more	clip_uri, page_image_uri, full_payload_uri

Do not confuse evidence with expansion. The evidence field should be compact. The expansion field should point to richer material.

Field Classes for Multimodal Retrieval

A practical schema separates fields by how often they should appear in agent responses.

1. Rank Fields

Rank fields are used by the retriever but usually not returned.

Examples:

dense_vector

sparse_vector

late_interaction_tokens

normalized_text

model-specific feature blobs

internal quality priors

These fields can be large and meaningless to an LLM. They should stay inside the retrieval engine unless the caller is debugging.

2. Cite Fields

Cite fields are small and almost always useful.

Examples:

source_uri

object_id

start_ms

end_ms

page_number

bounding_box

keyframe_url

model_id

extractor_version

For agents that search media, cite fields are not optional. They turn a generated answer into inspectable evidence.

3. Answer Fields

Answer fields are compact natural-language fields the model can reason over.

Examples:

transcript_excerpt

scene_caption

ocr_excerpt

object_summary

table_caption

alt_text

These fields should be clean text. Avoid embedding JSON blobs, timestamps, and unrelated metadata in the text sent to the model.

4. Governance Fields

Governance fields tell the agent whether it is allowed to use or reveal the evidence.

Examples:

tenant_id

acl_label

pii_level

retention_class

region

legal_hold

Some governance fields should be used for filtering but not returned to the model. Others should be returned so the tool caller can enforce policy outside the model.

5. Expansion Fields

Expansion fields are handles, not full data dumps.

Examples:

clip_uri

full_transcript_uri

page_image_uri

crop_uri

payload_uri

The first retrieval call should return handles. A second tool call can fetch the larger payload only when the agent needs it.

Late Materialization

Late materialization is the database pattern behind efficient projection.

In a naive system, the search engine loads full payloads for every candidate, ranks them, and then returns a subset. For unstructured data, those payloads can be large: thumbnails, transcripts, OCR blocks, JSON metadata, and nested feature objects.

Late materialization delays full payload fetch until after ranking.

query
  -> search index returns candidate IDs and scores
  -> reranker narrows candidates
  -> projection fetches selected fields only
  -> response returns compact evidence envelopes

This has three benefits:

1. Less data moves across the retrieval path. 2. Large fields are fetched only when they are actually needed. 3. Agent context receives a predictable payload shape.

Late materialization is especially valuable when the vector index sits near object storage. You can keep large source objects and payload blobs in cheap storage, while the hot query path returns only the fields required by the current tool call.

Context Budget Math

Projection is often a bigger context win than another prompt rewrite.

Assume a support-call search returns 20 transcript spans. Each full payload has:

250 tokens of transcript context.

80 tokens of metadata.

50 tokens of model lineage.

40 tokens of policy and sync metadata.

30 tokens of URLs and IDs.

That is about 450 tokens per result, or 9,000 tokens for 20 results.

If the agent only needs text, speaker, timestamp, source URI, and score, each result might be 90 tokens. The same 20 results become about 1,800 tokens.

That difference changes the retrieval plan:

You can retrieve more candidates for recall.

You can fit more sources into the model context.

You can keep citations without carrying irrelevant metadata.

You can reduce tool latency and network cost.

Context engineering is not only prompt design. It starts at the retrieval payload.

Query-Time Projection for Agents

Agents should request fields based on the task.

A question-answering task needs compact answer fields:

{
  "select_fields": [
    "source_uri",
    "text",
    "start_ms",
    "end_ms",
    "speaker",
    "score"
  ]
}

A visual inspection task needs media handles:

{
  "select_fields": [
    "source_uri",
    "keyframe_url",
    "caption",
    "objects",
    "timestamp_ms",
    "score"
  ]
}

A compliance task needs governance and provenance:

{
  "select_fields": [
    "source_uri",
    "policy_label",
    "evidence_text",
    "model_id",
    "extractor_version",
    "confidence",
    "review_uri"
  ]
}

A debugging task may need internal fields:

{
  "select_fields": [
    "id",
    "score",
    "vector_score",
    "bm25_score",
    "rerank_score",
    "payload_size_bytes"
  ]
}

The agent should not use one universal payload shape for every query. Different tools can expose different safe projections.

Tool Design Pattern

A retrieval tool can make projection explicit in its schema.

{
  "name": "search_media_evidence",
  "description": "Search indexed video, audio, image, and document evidence. Returns compact citeable spans and expansion handles.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {"type": "string"},
      "content_type": {"type": "string", "enum": ["video", "audio", "image", "document", "any"]},
      "top_k": {"type": "integer", "minimum": 1, "maximum": 50},
      "projection": {
        "type": "string",
        "enum": ["answer", "visual", "compliance", "debug"]
      }
    },
    "required": ["query"]
  }
}

The tool can map projection presets to field lists.

Preset

Fields

answer	text, source_uri, start_ms, end_ms, score
visual	keyframe_url, caption, objects, timestamp_ms, score
compliance	policy_label, evidence_text, model_id, confidence, review_uri
debug	score components, payload size, index partition, model version

This makes the model choose intent, while the application controls the exact fields.

Failure Modes

Returning full payloads by default. This makes prototyping easy and production agents noisy. Default to compact evidence.

Dropping citation fields. If the projected result omits source URI, timestamp, page, or bounding box, the agent cannot produce verifiable answers.

Embedding metadata into answer text. Text like "[00:13:42] speaker=customer policy=internal" pollutes embeddings and model context. Keep clean answer text and structured metadata separate.

Using projection as authorization. Projection can hide fields from a response, but it is not access control. Authorization must happen before retrieval and before expansion.

Returning fields the agent cannot interpret. Raw vectors, model logits, and large nested feature blobs are useful for debugging but poor answer context.

No expansion path. If the first result is compact but there is no way to fetch the source clip, page image, or full transcript, the agent gets stuck.

One projection for every tool. Search, compliance review, visual QA, and debugging need different payload shapes.

Evaluation

Evaluate projection separately from ranking.

Ranking asks: did the retriever find the right evidence?

Projection asks: did the response include the right fields, and only the right fields, for the task?

Useful metrics:

Metric

What it measures

Citation completeness	Percentage of results with source handle and time/page/box when needed
Payload bytes per result	Network and serialization cost
Prompt tokens per result	Context budget cost
Expansion rate	How often agents need a second fetch
Answer success with projection	Whether compact fields still let the model answer correctly
Leakage rate	Whether irrelevant or policy-sensitive fields are returned
Debug sufficiency	Whether debugging projections expose enough scoring information

For agent tools, add task-level tests:

1. Ask the agent a question requiring media evidence. 2. Require citations in the final answer. 3. Check that every cited answer maps to a returned source handle. 4. Check that compact projection succeeds without full payloads. 5. Run the same task with larger projections and compare answer quality, latency, and token use.

The goal is not the smallest possible payload. The goal is the smallest payload that lets the agent answer and cite correctly.

Mixpeek MVS Example

In MVS, store clean searchable fields and structured citation metadata.

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

mx.mvs.upsert(
    namespace="support-call-memory",
    vectors=[
        {
            "id": "call_481:822180:826920:bge_m3",
            "values": span_embedding,
            "metadata": {
                "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
                "text": "I would like a refund because the outage affected our launch",
                "speaker": "customer",
                "start_ms": 822180,
                "end_ms": 826920,
                "language": "en-US",
                "asr_model": "nvidia/nemotron-3.5-asr-streaming-0.6b",
                "aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B",
                "clip_uri": "mixpeek://clips/call_481/822180-826920"
            }
        }
    ]
)

Then query with a compact projection for the agent answer:

results = mx.mvs.search_dense(
    namespace="support-call-memory",
    vector=query_embedding,
    top_k=20,
    filter={
        "language": {"$eq": "en-US"}
    },
    select_fields=[
        "source_uri",
        "text",
        "speaker",
        "start_ms",
        "end_ms",
        "clip_uri"
    ]
)

For a visual evidence namespace, project only the fields the visual agent needs:

results = mx.mvs.search_dense(
    namespace="video-scene-memory",
    vector=query_embedding,
    top_k=10,
    select_fields=[
        "source_uri",
        "keyframe_url",
        "caption",
        "objects",
        "start_ms",
        "end_ms"
    ]
)

The first call answers "what evidence should the model read?" The expansion handle answers "where can the system fetch more if needed?"

Design Checklist

Separate rank fields, cite fields, answer fields, governance fields, and expansion fields.

Return source handles and time/page/box citations for media evidence.

Keep embedding text clean and store timestamps, speakers, and policies as metadata.

Use compact default projections for agent tools.

Expose projection presets rather than asking the model to invent arbitrary field lists.

Add expansion handles for full clips, full pages, full transcripts, and source payloads.

Treat projection as payload shaping, not authorization.

Evaluate prompt tokens per result and citation completeness alongside retrieval quality.

Use debug projections for scoring analysis, not for normal agent answers.

Key Takeaways

1. Payload projection is the retrieval contract between a vector store and an agent.

2. Ranking decides which results matter. Projection decides what evidence the agent sees.

3. Cite fields are mandatory for media agents. Without source handles, timestamps, pages, or boxes, answers cannot be verified.

4. Late materialization keeps large unstructured payloads out of the hot path until the agent actually needs them.

5. Good projection reduces prompt tokens, network cost, and irrelevant context without reducing answer quality.

6. The best default is a compact evidence envelope with expansion handles.