NEWVectors or files. Pick a path.Start →
    Agent Perception
    18 min read
    Updated 2026-06-08

    Mask-Aware Retrieval for AI Agents: Segment First, Search Crops, Then Reason

    Learn how segmentation masks, foreground crops, region embeddings, and spatial metadata make visual search more precise for agents that need to inspect images, video frames, screenshots, and product media.

    Agent Perception
    Image Segmentation
    Visual Search
    Multimodal Retrieval
    Computer Vision

    Why Whole-Image Retrieval Fails Agents



    Whole-image embeddings are useful for broad visual similarity. They can find photos with similar style, product category, scene type, or layout. But many agent tasks are not about the whole image. They are about a region:

  1. Find the small warning label on the lower-right corner.
  2. Search frames where the foreground product is visible without background clutter.
  3. Compare only the shoe, not the person wearing it.
  4. Find screenshots where the error toast appears on top of the dashboard.
  5. Inspect video frames where the actor holds the package, not frames where the package sits in the background.


  6. A single image vector compresses everything into one point. Large background regions, lighting, camera angle, and unrelated objects can dominate the representation. The agent may retrieve an image that feels visually similar while missing the actual evidence region.

    Mask-aware retrieval fixes the unit of search. Instead of indexing only the whole image, the system indexes foreground objects, regions, crops, masks, and spatial metadata. The agent can then search what matters, cite where it appears, and hand a focused crop to a downstream vision-language model.

    This passes the hard gate for agent perception: it helps an AI agent see and search unstructured visual content.

    The Basic Pattern



    Mask-aware retrieval has five stages:

    1. Generate candidate regions. Use segmentation, detection, OCR, or saliency models to find regions that may matter. 2. Store the mask and crop. Preserve pixel mask, bounding box, foreground ratio, crop URI, and source lineage. 3. Embed multiple views. Embed the whole image, the masked foreground, and the crop when each view may answer a different query. 4. Search in stages. Use metadata filters, whole-image recall, crop search, and reranking instead of one nearest-neighbor call. 5. Return inspectable evidence. Give the agent source URI, region coordinates, mask ID, crop URI, model version, and confidence.

    The important shift is from file-level search to evidence-region search.

    What a Mask Represents



    A segmentation mask is usually a binary or probabilistic image aligned to the source image. Each pixel says whether it belongs to a region.

    The mask can be represented several ways:

    RepresentationWhat it storesBest use
    Binary bitmapOne bit or byte per pixelPrecise local processing
    Probability mapFloat confidence per pixelThresholding and uncertainty
    Run-length encodingCompact spans of foreground pixelsStorage and transport
    PolygonBoundary verticesUI overlays and approximate geometry
    Bounding boxx, y, width, heightFast filters and crop creation
    Agents usually do not need the full mask in context. They need handles: a region ID, a crop URI, a box, a confidence score, and a way to request the mask when deeper inspection is needed.

    Four Ways to Create Regions



    1. Detector First



    Object detectors return boxes and labels. This is the fastest path when the query space is known:

  7. product
  8. logo
  9. face
  10. vehicle
  11. package
  12. screen
  13. table


  14. The detector gives a box. A segmentation model can refine the box into a tighter mask. This two-step pattern works well when you want class labels and clean crops.

    2. Segmentation First



    Prompt-free foreground segmentation returns the most salient object or foreground region without requiring a label. This is useful for product images, thumbnails, screenshots, and creative assets where the foreground object matters more than its class.

    Models like BiRefNet are useful in this layer because they target foreground/background and salient-object masks. The output is not "cat" or "chair." The output is "this is the visible region worth isolating."

    3. Promptable Segmentation



    SAM-style models can segment from points, boxes, or prompts. Use this when an agent or UI already knows the approximate region:

  15. a user clicks a product
  16. an OCR model finds text and asks for the surrounding panel
  17. an object detector proposes a box
  18. an agent asks to inspect "the red item on the left"


  19. Promptable segmentation is strong as a second pass. It is less useful as the only discovery mechanism unless another system proposes prompts.

    4. OCR and Layout Regions



    For screenshots, documents, dashboards, and ads, text regions are often the best masks. OCR supplies boxes for words, lines, and blocks. Layout models supply panels, cards, forms, and tables.

    These are visual regions even though they come from text extraction. A search for "error message near payment button" needs OCR text, spatial layout, and maybe a cropped UI panel.

    The Region Record



    A useful region record is more than a crop. It is a small evidence object.

    {
      "region_id": "img_924:frame_00018:mask_003:birefnet:v1",
      "source_uri": "s3://brand-assets/launch/ad_17.mp4",
      "frame_time_ms": 6000,
      "modality": "image",
      "region_type": "foreground_mask",
      "box": {"x": 0.18, "y": 0.22, "w": 0.41, "h": 0.52},
      "foreground_ratio": 0.31,
      "mask_uri": "s3://derived/ad_17/frame_00018/mask_003.rle",
      "crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
      "model_id": "ZhengPeng7/BiRefNet",
      "extractor_version": "segmentation@v1",
      "confidence": 0.91
    }
    


    This record lets an agent do several things:

  20. filter by foreground size
  21. retrieve similar crops
  22. inspect the crop with a VLM
  23. overlay the mask in a UI
  24. cite the exact frame and region
  25. rerun the region with a newer extractor


  26. Without this metadata, a crop becomes an orphaned image. It may be searchable, but it is not reliable evidence.

    Which Views to Embed



    Mask-aware retrieval usually stores more than one embedding per visual source.

    Whole Image



    Whole-image embeddings preserve context. They are good for broad questions:

  27. "outdoor construction scene"
  28. "dashboard screenshot"
  29. "studio product photo"
  30. "crowded retail shelf"


  31. Whole-image search is a good first-stage recall channel, but it often misses small regions.

    Foreground Crop



    Crop embeddings focus the model on the object or panel. They are good for:

  32. product matching
  33. visual duplicate detection
  34. small object search
  35. logo-like shapes
  36. UI widget search


  37. Crops can overfocus. If the agent needs surrounding context, return the crop plus the parent frame.

    Masked Image



    A masked image keeps the original image dimensions but removes or fades background pixels. This preserves some layout and scale while reducing background dominance.

    Masked images work well when the crop alone loses context:

  38. object position matters
  39. surrounding text matters
  40. scale relative to the image matters
  41. the background is distracting but not irrelevant


  42. Region Caption



    Some regions benefit from a caption or structured label:

  43. "red handheld scanner"
  44. "subtotal row in invoice table"
  45. "blue warning toast"
  46. "logo on cardboard box"


  47. Captions are not replacements for embeddings. They are a sparse or lexical channel that helps exact search, filters, and agent explanations.

    Retrieval Plans



    Do not route every visual query through the same index. Match the retrieval plan to the question.

    Query: "Find similar product photos"



    Plan:

    1. Search crop embeddings from foreground masks. 2. Filter for foreground_ratio above a threshold. 3. Rerank by product metadata, color, or category if available. 4. Return crop and parent image.

    Why: the product matters more than the studio background.

    Query: "Find screenshots with a payment error"



    Plan:

    1. Search OCR text for "payment", "declined", "failed", and related terms. 2. Search region captions for error panels and toast messages. 3. Join OCR regions to nearby UI-panel masks. 4. Return the panel crop and full screenshot.

    Why: exact text and spatial layout matter more than broad visual similarity.

    Query: "Find frames where the package is visible during the spoken CTA"



    Plan:

    1. Search transcript spans for CTA language. 2. Search foreground crops for package-like regions. 3. Join regions and transcript spans by overlapping timestamps. 4. Rerank clips where the crop occupies enough area and persists across frames.

    Why: the query is temporal and cross-modal. A visual match alone is insufficient.

    Query: "Show ads where the background distracts from the product"



    Plan:

    1. Compute foreground ratio and background salience. 2. Search whole-image embeddings for busy scenes. 3. Compare whole-image similarity against crop similarity. 4. Flag images where the crop matches the product query but the whole image retrieves unrelated background concepts.

    Why: this is a quality-control query about foreground/background separation.

    Scoring and Fusion



    Mask-aware retrieval creates several scores:

  48. whole-image vector similarity
  49. crop vector similarity
  50. caption or OCR score
  51. detector confidence
  52. mask confidence
  53. foreground ratio
  54. spatial relationship
  55. temporal overlap for video


  56. Do not treat these scores as directly comparable. A cosine score from an image embedding is not the same thing as OCR BM25, detector confidence, or temporal overlap.

    Use a fusion strategy:

    StrategyUse when
    Hard filtersThe condition must be true, such as foreground_ratio >= 0.2
    Reciprocal rank fusionSeveral retrieval channels produce ranked lists
    Weighted fusionQuery classes are known and weights can be validated
    RerankingA smaller candidate set needs visual or multimodal inspection
    Rules plus retrievalPolicy requires exact constraints before semantic search
    For agents, return the component scores. The agent should know whether a result matched because the crop was similar, the OCR text matched, or the whole frame looked similar.

    Failure Modes



    Background Leakage



    The crop includes too much background, so retrieval still matches on scene instead of object. Tighten the mask, pad the box less, or use a masked image instead of a loose crop.

    Overcropping



    The crop removes context needed for meaning. A product in a hand, a warning label on a machine, or an icon inside a UI panel may need surrounding pixels. Return parent context with every crop.

    Duplicate Regions



    Segmenters and detectors may produce overlapping regions. Use non-maximum suppression, mask IoU, or embedding deduplication to avoid indexing the same object many times.

    Small Object Blindness



    Small objects may be missed by whole-frame keyframes. Increase frame resolution, use detector-first proposals, or index tiled regions.

    Mask Confidence Drift



    A mask model trained on clean product photos may perform poorly on CCTV, medical images, or low-light video. Track confidence distributions by source type and evaluate region quality on real data.

    Region Without Lineage



    If a crop loses its source URI, timestamp, model version, or box, it cannot support an agent answer. Treat missing lineage as an ingestion bug.

    Evaluation



    Evaluate mask-aware retrieval at the region level, not only the image level.

    Useful metrics:

  57. Region recall@k: did the correct region appear in the top k?
  58. Parent recall@k: did the correct source image or frame appear in the top k?
  59. Mask IoU: does the predicted mask overlap the human-labeled region?
  60. Box IoU: does the returned box localize the evidence?
  61. Crop usefulness: can a VLM answer the question from the crop alone?
  62. Context sufficiency: can a VLM answer when given crop plus parent frame?
  63. False-region rate: how often does retrieval return the right image but wrong object?


  64. The false-region rate is especially important. If the agent gets the right image but cites the wrong region, the final answer can be confidently wrong.

    Agent Tool Design



    Expose region search as a bounded tool. A good tool returns evidence handles, not raw pixels in the main context.

    {
      "tool": "search_visual_regions",
      "input": {
        "query": "red handheld scanner",
        "collection": "retail_media",
        "region_types": ["foreground_mask", "object_box"],
        "min_foreground_ratio": 0.15,
        "limit": 10
      },
      "output": {
        "regions": [
          {
            "region_id": "img_924:frame_00018:mask_003:birefnet:v1",
            "source_uri": "s3://brand-assets/launch/ad_17.mp4",
            "timestamp_ms": 6000,
            "crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
            "box": {"x": 0.18, "y": 0.22, "w": 0.41, "h": 0.52},
            "matched_channels": ["crop_embedding", "region_caption"],
            "scores": {"crop_rank": 1, "caption_rank": 4}
          }
        ]
      }
    }
    


    The agent can then call a follow-up tool:

  65. inspect_region
  66. open_parent_frame
  67. compare_regions
  68. expand_time_window
  69. retrieve_neighboring_regions


  70. This keeps visual reasoning iterative without flooding the context window with pixels.

    How This Maps to Mixpeek and MVS



    Mixpeek's managed system can extract segmentation, object, OCR, caption, and embedding features from the media already in object storage. MVS can store and search the vector layer for teams bringing their own crops or embeddings.

    Managed ingestion example:

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    mx.collections.ingest( collection_id="visual-evidence", source={"url": "s3://brand-assets/launch/"}, feature_extractors=[ { "name": "segmentation", "version": "v1", "params": { "model_id": "ZhengPeng7/BiRefNet", "output_masks": True, "store_crops": True } }, { "name": "visual_embeddings", "version": "v1", "params": { "model_id": "nomic-ai/nomic-embed-vision-v1.5", "embed_views": ["whole_image", "foreground_crop", "masked_image"] } }, { "name": "ocr", "version": "v1" } ] )


    Retriever example:

    results = mx.retrievers.retrieve(
        retriever_id="visual-region-agent",
        query="frames where the package is visible during the spoken call to action",
        pipeline=[
            {
                "stage_type": "search",
                "stage_id": "cta_transcript",
                "feature": "transcription",
                "limit": 100
            },
            {
                "stage_type": "search",
                "stage_id": "package_crops",
                "feature": "visual_embeddings",
                "view": "foreground_crop",
                "limit": 100
            },
            {
                "stage_type": "join",
                "stage_id": "same_time_window",
                "on": "timestamp_overlap",
                "window_seconds": 2
            },
            {
                "stage_type": "fusion",
                "stage_id": "ranked_evidence",
                "method": "reciprocal_rank_fusion",
                "limit": 20
            }
        ]
    )
    


    MVS standalone example for bring-your-own crops:

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    mx.mvs.upsert( namespace="visual-region-memory", vectors=[ { "id": "img_924:frame_00018:mask_003:nomic_v15", "values": crop_embedding, "metadata": { "source_uri": "s3://brand-assets/launch/ad_17.mp4", "crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp", "mask_uri": "s3://derived/ad_17/frame_00018/mask_003.rle", "frame_time_ms": 6000, "model_id": "nomic-ai/nomic-embed-vision-v1.5", "region_type": "foreground_mask", "foreground_ratio": 0.31 } } ] )


    The same design principle applies either way: preserve region lineage, search the right visual unit, and return evidence the agent can inspect.

    Design Checklist



  71. Whole-image, crop, and masked-image views are stored separately.
  72. Every region has source URI, timestamp or page, box, mask ID, crop URI, model ID, and extractor version.
  73. Foreground ratio and confidence are available as filters.
  74. Overlapping masks are deduplicated before indexing.
  75. Region search returns parent context, not only cropped pixels.
  76. Query planners can choose whole-image, crop, OCR, object, or hybrid retrieval.
  77. Scores from different channels are fused by rank or validated weights.
  78. Region-level recall, mask IoU, and false-region rate are tracked.
  79. Agents receive evidence handles and follow-up inspection tools.
  80. Backfills can regenerate masks and crops without changing source object identity.


  81. Key Takeaways



    1. Whole-image embeddings are useful, but they are often too coarse for agent evidence.

    2. Segmentation masks change the search unit from file to region.

    3. Store masks, boxes, crops, foreground ratios, model versions, and source lineage together.

    4. Embed whole images, foreground crops, and masked images when each view answers different queries.

    5. Retrieval should be planned by query type. OCR, object filters, crop search, whole-image search, and reranking solve different parts of visual reasoning.

    6. Agents need region evidence they can inspect and cite, not only a ranked list of image files.

    Further Reading



  82. Object Decomposition and Layered Indexing
  83. Open-Vocabulary Object Detection
  84. Visual Document Retrieval
  85. Agent Perception Evals
  86. BiRefNet on Hugging Face
  87. Nomic Embed Vision v1.5 on Hugging Face
  88. MCP tools specification
  89. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs