NEWVectors or files. Pick a path.Start →
    Retrieval
    17 min read
    Updated 2026-06-10

    Payload Projection for Agentic Vector Search: Field Selection, Evidence Handles, and Context Budgets

    Learn how field projection keeps agent retrieval fast, citeable, and context-efficient. Covers payload design, late materialization, evidence envelopes, select_fields, and Mixpeek MVS examples for unstructured media search.

    Agent Retrieval
    Vector Search
    Payload Projection
    Context Engineering
    MVS
    Unstructured Data

    Why Payload Size Is an Agent Problem



    Vector search usually gets discussed as a ranking problem: embed the query, find nearest neighbors, rerank, return results. That is only half of the system an agent experiences.

    An agent does not consume a vector score. It consumes the payload attached to each result. That payload might include transcript text, OCR spans, image URLs, keyframe thumbnails, bounding boxes, model confidence, speaker labels, tenant metadata, access policy, and source lineage.

    If the retriever returns too little, the agent cannot answer or cite evidence. If it returns too much, the agent burns context, leaks irrelevant fields into the prompt, increases latency, and makes tool output harder to reason over.

    Payload projection is the retrieval-layer control that decides which fields come back for a query.

    This matters most for unstructured content because each result can be a dense evidence object:

  1. A video hit may contain a clip URI, keyframe URI, start time, end time, scene caption, OCR text, transcript span, objects, faces, and safety labels.
  2. An audio hit may contain transcript text, speaker labels, timestamps, diarization confidence, language tags, and source handles.
  3. A document hit may contain page image, text block, layout boxes, table cells, formulas, and redaction metadata.
  4. An image hit may contain embeddings, prompt-generated captions, masks, crops, bounding boxes, and provenance.


  5. For agents, retrieval is not only "top 10 nearest vectors." It is "top 10 evidence packets that fit the task."

    What Payload Projection Means



    Payload projection is field selection for retrieval results. The query still ranks over the indexed representation, but the response returns only the fields the caller asks for.

    The basic shape:

    {
      "query": "customer asked for a refund after an outage",
      "top_k": 10,
      "select_fields": [
        "source_uri",
        "text",
        "speaker",
        "start_ms",
        "end_ms"
      ]
    }
    


    The ranking engine can still use vectors, sparse terms, filters, and metadata. Projection controls the output payload.

    A useful way to separate the concerns:

    ConcernQuestion it answersExample
    RankingWhich items are most relevant?Vector score, BM25 score, reranker score
    FilteringWhich items are allowed?customer_id, date, content_type, policy label
    ProjectionWhich fields should come back?source_uri, span text, timestamp, thumbnail
    ExpansionWhat should be fetched after selection?full transcript window, full image, full PDF page
    Projection is not a replacement for ranking. It is the contract between the retriever and the agent.

    Projection vs. Filtering vs. Reranking



    These operations are often confused because they all appear near the query.

    Filtering changes the candidate set. If an agent asks for "refund calls from enterprise customers last week," filters should restrict the search to enterprise accounts and the date window before ranking.

    Reranking changes the order. A first-stage vector search may retrieve 200 candidate spans, then a cross-encoder or late-interaction model rescoring the top candidates can improve precision.

    Projection changes the returned fields. The retriever may rank over hidden internal fields and still return only a compact evidence envelope.

    For example, a video search system may rank over:

  6. CLIP or SigLIP frame embeddings.
  7. ASR transcript embeddings.
  8. OCR sparse terms.
  9. Object labels.
  10. Scene captions.
  11. User metadata.


  12. But the agent may only need:

  13. source_uri
  14. start_ms
  15. end_ms
  16. short_caption
  17. transcript_excerpt
  18. keyframe_url


  19. That output is smaller, cleaner, and easier to cite.

    The Evidence Envelope Pattern



    Agents work best when retrievers return structured evidence envelopes instead of raw database rows.

    {
      "id": "call_481:822180:826920",
      "score": 0.84,
      "evidence": {
        "text": "I would like a refund because the outage affected our launch",
        "source_uri": "s3://support-calls/2026/06/09/call_481.wav",
        "start_ms": 822180,
        "end_ms": 826920,
        "speaker": "customer"
      },
      "expand": {
        "clip_uri": "mixpeek://clips/call_481/822180-826920",
        "nearby_context_uri": "mixpeek://spans/call_481/819000-830000"
      }
    }
    


    The envelope separates immediate evidence from expansion handles.

    Immediate evidence is what the agent needs to answer now. Expansion handles let the agent fetch more if needed. This is the same idea behind good tool design: return enough structured output to act, but do not dump the entire object graph into the model context.

    The envelope should usually contain five field classes.

    Field classPurposeExamples
    IdentityLet the system deduplicate and trace resultsid, namespace, object_id, span_id
    CitationLet a human verify the answersource_uri, page, start_ms, end_ms, keyframe_url
    EvidenceLet the model answertext, caption, ocr_excerpt, object_label
    ConfidenceLet the model handle uncertaintyscore, model_confidence, speaker_overlap
    ExpansionLet the agent fetch moreclip_uri, page_image_uri, full_payload_uri
    Do not confuse evidence with expansion. The evidence field should be compact. The expansion field should point to richer material.

    Field Classes for Multimodal Retrieval



    A practical schema separates fields by how often they should appear in agent responses.

    1. Rank Fields



    Rank fields are used by the retriever but usually not returned.

    Examples:

  20. dense_vector
  21. sparse_vector
  22. late_interaction_tokens
  23. normalized_text
  24. model-specific feature blobs
  25. internal quality priors


  26. These fields can be large and meaningless to an LLM. They should stay inside the retrieval engine unless the caller is debugging.

    2. Cite Fields



    Cite fields are small and almost always useful.

    Examples:

  27. source_uri
  28. object_id
  29. start_ms
  30. end_ms
  31. page_number
  32. bounding_box
  33. keyframe_url
  34. model_id
  35. extractor_version


  36. For agents that search media, cite fields are not optional. They turn a generated answer into inspectable evidence.

    3. Answer Fields



    Answer fields are compact natural-language fields the model can reason over.

    Examples:

  37. transcript_excerpt
  38. scene_caption
  39. ocr_excerpt
  40. object_summary
  41. table_caption
  42. alt_text


  43. These fields should be clean text. Avoid embedding JSON blobs, timestamps, and unrelated metadata in the text sent to the model.

    4. Governance Fields



    Governance fields tell the agent whether it is allowed to use or reveal the evidence.

    Examples:

  44. tenant_id
  45. acl_label
  46. pii_level
  47. retention_class
  48. region
  49. legal_hold


  50. Some governance fields should be used for filtering but not returned to the model. Others should be returned so the tool caller can enforce policy outside the model.

    5. Expansion Fields



    Expansion fields are handles, not full data dumps.

    Examples:

  51. clip_uri
  52. full_transcript_uri
  53. page_image_uri
  54. crop_uri
  55. payload_uri


  56. The first retrieval call should return handles. A second tool call can fetch the larger payload only when the agent needs it.

    Late Materialization



    Late materialization is the database pattern behind efficient projection.

    In a naive system, the search engine loads full payloads for every candidate, ranks them, and then returns a subset. For unstructured data, those payloads can be large: thumbnails, transcripts, OCR blocks, JSON metadata, and nested feature objects.

    Late materialization delays full payload fetch until after ranking.

    query
      -> search index returns candidate IDs and scores
      -> reranker narrows candidates
      -> projection fetches selected fields only
      -> response returns compact evidence envelopes
    


    This has three benefits:

    1. Less data moves across the retrieval path. 2. Large fields are fetched only when they are actually needed. 3. Agent context receives a predictable payload shape.

    Late materialization is especially valuable when the vector index sits near object storage. You can keep large source objects and payload blobs in cheap storage, while the hot query path returns only the fields required by the current tool call.

    Context Budget Math



    Projection is often a bigger context win than another prompt rewrite.

    Assume a support-call search returns 20 transcript spans. Each full payload has:

  57. 250 tokens of transcript context.
  58. 80 tokens of metadata.
  59. 50 tokens of model lineage.
  60. 40 tokens of policy and sync metadata.
  61. 30 tokens of URLs and IDs.


  62. That is about 450 tokens per result, or 9,000 tokens for 20 results.

    If the agent only needs text, speaker, timestamp, source URI, and score, each result might be 90 tokens. The same 20 results become about 1,800 tokens.

    That difference changes the retrieval plan:

  63. You can retrieve more candidates for recall.
  64. You can fit more sources into the model context.
  65. You can keep citations without carrying irrelevant metadata.
  66. You can reduce tool latency and network cost.


  67. Context engineering is not only prompt design. It starts at the retrieval payload.

    Query-Time Projection for Agents



    Agents should request fields based on the task.

    A question-answering task needs compact answer fields:

    {
      "select_fields": [
        "source_uri",
        "text",
        "start_ms",
        "end_ms",
        "speaker",
        "score"
      ]
    }
    


    A visual inspection task needs media handles:

    {
      "select_fields": [
        "source_uri",
        "keyframe_url",
        "caption",
        "objects",
        "timestamp_ms",
        "score"
      ]
    }
    


    A compliance task needs governance and provenance:

    {
      "select_fields": [
        "source_uri",
        "policy_label",
        "evidence_text",
        "model_id",
        "extractor_version",
        "confidence",
        "review_uri"
      ]
    }
    


    A debugging task may need internal fields:

    {
      "select_fields": [
        "id",
        "score",
        "vector_score",
        "bm25_score",
        "rerank_score",
        "payload_size_bytes"
      ]
    }
    


    The agent should not use one universal payload shape for every query. Different tools can expose different safe projections.

    Tool Design Pattern



    A retrieval tool can make projection explicit in its schema.

    {
      "name": "search_media_evidence",
      "description": "Search indexed video, audio, image, and document evidence. Returns compact citeable spans and expansion handles.",
      "input_schema": {
        "type": "object",
        "properties": {
          "query": {"type": "string"},
          "content_type": {"type": "string", "enum": ["video", "audio", "image", "document", "any"]},
          "top_k": {"type": "integer", "minimum": 1, "maximum": 50},
          "projection": {
            "type": "string",
            "enum": ["answer", "visual", "compliance", "debug"]
          }
        },
        "required": ["query"]
      }
    }
    


    The tool can map projection presets to field lists.

    PresetFields
    answertext, source_uri, start_ms, end_ms, score
    visualkeyframe_url, caption, objects, timestamp_ms, score
    compliancepolicy_label, evidence_text, model_id, confidence, review_uri
    debugscore components, payload size, index partition, model version
    This makes the model choose intent, while the application controls the exact fields.

    Failure Modes



    Returning full payloads by default. This makes prototyping easy and production agents noisy. Default to compact evidence.

    Dropping citation fields. If the projected result omits source URI, timestamp, page, or bounding box, the agent cannot produce verifiable answers.

    Embedding metadata into answer text. Text like "[00:13:42] speaker=customer policy=internal" pollutes embeddings and model context. Keep clean answer text and structured metadata separate.

    Using projection as authorization. Projection can hide fields from a response, but it is not access control. Authorization must happen before retrieval and before expansion.

    Returning fields the agent cannot interpret. Raw vectors, model logits, and large nested feature blobs are useful for debugging but poor answer context.

    No expansion path. If the first result is compact but there is no way to fetch the source clip, page image, or full transcript, the agent gets stuck.

    One projection for every tool. Search, compliance review, visual QA, and debugging need different payload shapes.

    Evaluation



    Evaluate projection separately from ranking.

    Ranking asks: did the retriever find the right evidence?

    Projection asks: did the response include the right fields, and only the right fields, for the task?

    Useful metrics:

    MetricWhat it measures
    Citation completenessPercentage of results with source handle and time/page/box when needed
    Payload bytes per resultNetwork and serialization cost
    Prompt tokens per resultContext budget cost
    Expansion rateHow often agents need a second fetch
    Answer success with projectionWhether compact fields still let the model answer correctly
    Leakage rateWhether irrelevant or policy-sensitive fields are returned
    Debug sufficiencyWhether debugging projections expose enough scoring information
    For agent tools, add task-level tests:

    1. Ask the agent a question requiring media evidence. 2. Require citations in the final answer. 3. Check that every cited answer maps to a returned source handle. 4. Check that compact projection succeeds without full payloads. 5. Run the same task with larger projections and compare answer quality, latency, and token use.

    The goal is not the smallest possible payload. The goal is the smallest payload that lets the agent answer and cite correctly.

    Mixpeek MVS Example



    In MVS, store clean searchable fields and structured citation metadata.

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    mx.mvs.upsert( namespace="support-call-memory", vectors=[ { "id": "call_481:822180:826920:bge_m3", "values": span_embedding, "metadata": { "source_uri": "s3://support-calls/2026/06/09/call_481.wav", "text": "I would like a refund because the outage affected our launch", "speaker": "customer", "start_ms": 822180, "end_ms": 826920, "language": "en-US", "asr_model": "nvidia/nemotron-3.5-asr-streaming-0.6b", "aligner_model": "Qwen/Qwen3-ForcedAligner-0.6B", "clip_uri": "mixpeek://clips/call_481/822180-826920" } } ] )


    Then query with a compact projection for the agent answer:

    results = mx.mvs.search_dense(
        namespace="support-call-memory",
        vector=query_embedding,
        top_k=20,
        filter={
            "language": {"$eq": "en-US"}
        },
        select_fields=[
            "source_uri",
            "text",
            "speaker",
            "start_ms",
            "end_ms",
            "clip_uri"
        ]
    )
    


    For a visual evidence namespace, project only the fields the visual agent needs:

    results = mx.mvs.search_dense(
        namespace="video-scene-memory",
        vector=query_embedding,
        top_k=10,
        select_fields=[
            "source_uri",
            "keyframe_url",
            "caption",
            "objects",
            "start_ms",
            "end_ms"
        ]
    )
    


    The first call answers "what evidence should the model read?" The expansion handle answers "where can the system fetch more if needed?"

    Design Checklist



  68. Separate rank fields, cite fields, answer fields, governance fields, and expansion fields.
  69. Return source handles and time/page/box citations for media evidence.
  70. Keep embedding text clean and store timestamps, speakers, and policies as metadata.
  71. Use compact default projections for agent tools.
  72. Expose projection presets rather than asking the model to invent arbitrary field lists.
  73. Add expansion handles for full clips, full pages, full transcripts, and source payloads.
  74. Treat projection as payload shaping, not authorization.
  75. Evaluate prompt tokens per result and citation completeness alongside retrieval quality.
  76. Use debug projections for scoring analysis, not for normal agent answers.


  77. Key Takeaways



    1. Payload projection is the retrieval contract between a vector store and an agent.

    2. Ranking decides which results matter. Projection decides what evidence the agent sees.

    3. Cite fields are mandatory for media agents. Without source handles, timestamps, pages, or boxes, answers cannot be verified.

    4. Late materialization keeps large unstructured payloads out of the hot path until the agent actually needs them.

    5. Good projection reduces prompt tokens, network cost, and irrelevant context without reducing answer quality.

    6. The best default is a compact evidence envelope with expansion handles.

    Further Reading



  78. Agentic Retrieval: How AI Agents Search Differently Than Humans
  79. Multi-Stage Retrieval: How AI Agents Search Unstructured Data at Scale
  80. Retrieval Control Planes for AI Agents
  81. MVS: Agent-native vector store on object storage
  82. Model Context Protocol tools specification
  83. LangChain contextual compression
  84. LlamaIndex retriever documentation
  85. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs