Mask-Aware Retrieval for AI Agents: Segment First, Search Crops, Then Reason

Why Whole-Image Retrieval Fails Agents

Whole-image embeddings are useful for broad visual similarity. They can find photos with similar style, product category, scene type, or layout. But many agent tasks are not about the whole image. They are about a region:

Find the small warning label on the lower-right corner.

Search frames where the foreground product is visible without background clutter.

Compare only the shoe, not the person wearing it.

Find screenshots where the error toast appears on top of the dashboard.

Inspect video frames where the actor holds the package, not frames where the package sits in the background.

A single image vector compresses everything into one point. Large background regions, lighting, camera angle, and unrelated objects can dominate the representation. The agent may retrieve an image that feels visually similar while missing the actual evidence region.

Mask-aware retrieval fixes the unit of search. Instead of indexing only the whole image, the system indexes foreground objects, regions, crops, masks, and spatial metadata. The agent can then search what matters, cite where it appears, and hand a focused crop to a downstream vision-language model.

This passes the hard gate for agent perception: it helps an AI agent see and search unstructured visual content.

The Basic Pattern

Mask-aware retrieval has five stages:

1. Generate candidate regions. Use segmentation, detection, OCR, or saliency models to find regions that may matter. 2. Store the mask and crop. Preserve pixel mask, bounding box, foreground ratio, crop URI, and source lineage. 3. Embed multiple views. Embed the whole image, the masked foreground, and the crop when each view may answer a different query. 4. Search in stages. Use metadata filters, whole-image recall, crop search, and reranking instead of one nearest-neighbor call. 5. Return inspectable evidence. Give the agent source URI, region coordinates, mask ID, crop URI, model version, and confidence.

The important shift is from file-level search to evidence-region search.

What a Mask Represents

A segmentation mask is usually a binary or probabilistic image aligned to the source image. Each pixel says whether it belongs to a region.

The mask can be represented several ways:

Representation

What it stores

Best use

Binary bitmap	One bit or byte per pixel	Precise local processing
Probability map	Float confidence per pixel	Thresholding and uncertainty
Run-length encoding	Compact spans of foreground pixels	Storage and transport
Polygon	Boundary vertices	UI overlays and approximate geometry
Bounding box	x, y, width, height	Fast filters and crop creation

Agents usually do not need the full mask in context. They need handles: a region ID, a crop URI, a box, a confidence score, and a way to request the mask when deeper inspection is needed.

Four Ways to Create Regions

1. Detector First

Object detectors return boxes and labels. This is the fastest path when the query space is known:

product

logo

face

vehicle

package

screen

table

The detector gives a box. A segmentation model can refine the box into a tighter mask. This two-step pattern works well when you want class labels and clean crops.

2. Segmentation First

Prompt-free foreground segmentation returns the most salient object or foreground region without requiring a label. This is useful for product images, thumbnails, screenshots, and creative assets where the foreground object matters more than its class.

Models like BiRefNet are useful in this layer because they target foreground/background and salient-object masks. The output is not "cat" or "chair." The output is "this is the visible region worth isolating."

3. Promptable Segmentation

SAM-style models can segment from points, boxes, or prompts. Use this when an agent or UI already knows the approximate region:

a user clicks a product

an OCR model finds text and asks for the surrounding panel

an object detector proposes a box

an agent asks to inspect "the red item on the left"

Promptable segmentation is strong as a second pass. It is less useful as the only discovery mechanism unless another system proposes prompts.

4. OCR and Layout Regions

For screenshots, documents, dashboards, and ads, text regions are often the best masks. OCR supplies boxes for words, lines, and blocks. Layout models supply panels, cards, forms, and tables.

These are visual regions even though they come from text extraction. A search for "error message near payment button" needs OCR text, spatial layout, and maybe a cropped UI panel.

The Region Record

A useful region record is more than a crop. It is a small evidence object.

{
  "region_id": "img_924:frame_00018:mask_003:birefnet:v1",
  "source_uri": "s3://brand-assets/launch/ad_17.mp4",
  "frame_time_ms": 6000,
  "modality": "image",
  "region_type": "foreground_mask",
  "box": {"x": 0.18, "y": 0.22, "w": 0.41, "h": 0.52},
  "foreground_ratio": 0.31,
  "mask_uri": "s3://derived/ad_17/frame_00018/mask_003.rle",
  "crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
  "model_id": "ZhengPeng7/BiRefNet",
  "extractor_version": "segmentation@v1",
  "confidence": 0.91
}

This record lets an agent do several things:

filter by foreground size

retrieve similar crops

inspect the crop with a VLM

overlay the mask in a UI

cite the exact frame and region

rerun the region with a newer extractor

Without this metadata, a crop becomes an orphaned image. It may be searchable, but it is not reliable evidence.

Which Views to Embed

Mask-aware retrieval usually stores more than one embedding per visual source.

Whole Image

Whole-image embeddings preserve context. They are good for broad questions:

"outdoor construction scene"

"dashboard screenshot"

"studio product photo"

"crowded retail shelf"

Whole-image search is a good first-stage recall channel, but it often misses small regions.

Foreground Crop

Crop embeddings focus the model on the object or panel. They are good for:

product matching

visual duplicate detection

small object search

logo-like shapes

UI widget search

Crops can overfocus. If the agent needs surrounding context, return the crop plus the parent frame.

Masked Image

A masked image keeps the original image dimensions but removes or fades background pixels. This preserves some layout and scale while reducing background dominance.

Masked images work well when the crop alone loses context:

object position matters

surrounding text matters

scale relative to the image matters

the background is distracting but not irrelevant

Region Caption

Some regions benefit from a caption or structured label:

"red handheld scanner"

"subtotal row in invoice table"

"blue warning toast"

"logo on cardboard box"

Captions are not replacements for embeddings. They are a sparse or lexical channel that helps exact search, filters, and agent explanations.

Retrieval Plans

Do not route every visual query through the same index. Match the retrieval plan to the question.

Query: "Find similar product photos"

Plan:

1. Search crop embeddings from foreground masks. 2. Filter for foreground_ratio above a threshold. 3. Rerank by product metadata, color, or category if available. 4. Return crop and parent image.

Why: the product matters more than the studio background.

Query: "Find screenshots with a payment error"

Plan:

1. Search OCR text for "payment", "declined", "failed", and related terms. 2. Search region captions for error panels and toast messages. 3. Join OCR regions to nearby UI-panel masks. 4. Return the panel crop and full screenshot.

Why: exact text and spatial layout matter more than broad visual similarity.

Query: "Find frames where the package is visible during the spoken CTA"

Plan:

1. Search transcript spans for CTA language. 2. Search foreground crops for package-like regions. 3. Join regions and transcript spans by overlapping timestamps. 4. Rerank clips where the crop occupies enough area and persists across frames.

Why: the query is temporal and cross-modal. A visual match alone is insufficient.

Query: "Show ads where the background distracts from the product"

Plan:

1. Compute foreground ratio and background salience. 2. Search whole-image embeddings for busy scenes. 3. Compare whole-image similarity against crop similarity. 4. Flag images where the crop matches the product query but the whole image retrieves unrelated background concepts.

Why: this is a quality-control query about foreground/background separation.

Scoring and Fusion

Mask-aware retrieval creates several scores:

whole-image vector similarity

crop vector similarity

caption or OCR score

detector confidence

mask confidence

foreground ratio

spatial relationship

temporal overlap for video

Do not treat these scores as directly comparable. A cosine score from an image embedding is not the same thing as OCR BM25, detector confidence, or temporal overlap.

Use a fusion strategy:

Strategy

Use when

Hard filters	The condition must be true, such as foreground_ratio >= 0.2
Reciprocal rank fusion	Several retrieval channels produce ranked lists
Weighted fusion	Query classes are known and weights can be validated
Reranking	A smaller candidate set needs visual or multimodal inspection
Rules plus retrieval	Policy requires exact constraints before semantic search

For agents, return the component scores. The agent should know whether a result matched because the crop was similar, the OCR text matched, or the whole frame looked similar.

Failure Modes

Background Leakage

The crop includes too much background, so retrieval still matches on scene instead of object. Tighten the mask, pad the box less, or use a masked image instead of a loose crop.

Overcropping

The crop removes context needed for meaning. A product in a hand, a warning label on a machine, or an icon inside a UI panel may need surrounding pixels. Return parent context with every crop.

Duplicate Regions

Segmenters and detectors may produce overlapping regions. Use non-maximum suppression, mask IoU, or embedding deduplication to avoid indexing the same object many times.

Small Object Blindness

Small objects may be missed by whole-frame keyframes. Increase frame resolution, use detector-first proposals, or index tiled regions.

Mask Confidence Drift

A mask model trained on clean product photos may perform poorly on CCTV, medical images, or low-light video. Track confidence distributions by source type and evaluate region quality on real data.

Region Without Lineage

If a crop loses its source URI, timestamp, model version, or box, it cannot support an agent answer. Treat missing lineage as an ingestion bug.

Evaluation

Evaluate mask-aware retrieval at the region level, not only the image level.

Useful metrics:

Region recall@k: did the correct region appear in the top k?

Parent recall@k: did the correct source image or frame appear in the top k?

Mask IoU: does the predicted mask overlap the human-labeled region?

Box IoU: does the returned box localize the evidence?

Crop usefulness: can a VLM answer the question from the crop alone?

Context sufficiency: can a VLM answer when given crop plus parent frame?

False-region rate: how often does retrieval return the right image but wrong object?

The false-region rate is especially important. If the agent gets the right image but cites the wrong region, the final answer can be confidently wrong.

Agent Tool Design

Expose region search as a bounded tool. A good tool returns evidence handles, not raw pixels in the main context.

{
  "tool": "search_visual_regions",
  "input": {
    "query": "red handheld scanner",
    "collection": "retail_media",
    "region_types": ["foreground_mask", "object_box"],
    "min_foreground_ratio": 0.15,
    "limit": 10
  },
  "output": {
    "regions": [
      {
        "region_id": "img_924:frame_00018:mask_003:birefnet:v1",
        "source_uri": "s3://brand-assets/launch/ad_17.mp4",
        "timestamp_ms": 6000,
        "crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
        "box": {"x": 0.18, "y": 0.22, "w": 0.41, "h": 0.52},
        "matched_channels": ["crop_embedding", "region_caption"],
        "scores": {"crop_rank": 1, "caption_rank": 4}
      }
    ]
  }
}

The agent can then call a follow-up tool:

inspect_region

open_parent_frame

compare_regions

expand_time_window

retrieve_neighboring_regions

This keeps visual reasoning iterative without flooding the context window with pixels.

How This Maps to Mixpeek and MVS

Mixpeek's managed system can extract segmentation, object, OCR, caption, and embedding features from the media already in object storage. MVS can store and search the vector layer for teams bringing their own crops or embeddings.

Managed ingestion example:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

mx.collections.create(
    namespace_id="my-namespace",
    collection_name="my-collection",
    source={"type": "bucket", "bucket_ids": ["bkt_your_bucket"]},
    feature_extractor={"feature_extractor_name": "segmentation", "version": "v1"},
)

Retriever example:

results = mx.retrievers.execute(
    retriever_id="your-retriever-id",
    query="frames where the package is visible during the spoken call to action",
)

MVS standalone example for bring-your-own crops:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

mx.mvs.upsert(
    namespace="visual-region-memory",
    vectors=[
        {
            "id": "img_924:frame_00018:mask_003:nomic_v15",
            "values": crop_embedding,
            "metadata": {
                "source_uri": "s3://brand-assets/launch/ad_17.mp4",
                "crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
                "mask_uri": "s3://derived/ad_17/frame_00018/mask_003.rle",
                "frame_time_ms": 6000,
                "model_id": "nomic-ai/nomic-embed-vision-v1.5",
                "region_type": "foreground_mask",
                "foreground_ratio": 0.31
            }
        }
    ]
)

The same design principle applies either way: preserve region lineage, search the right visual unit, and return evidence the agent can inspect.

Design Checklist

Whole-image, crop, and masked-image views are stored separately.

Every region has source URI, timestamp or page, box, mask ID, crop URI, model ID, and extractor version.

Foreground ratio and confidence are available as filters.

Overlapping masks are deduplicated before indexing.

Region search returns parent context, not only cropped pixels.

Query planners can choose whole-image, crop, OCR, object, or hybrid retrieval.

Scores from different channels are fused by rank or validated weights.

Region-level recall, mask IoU, and false-region rate are tracked.

Agents receive evidence handles and follow-up inspection tools.

Backfills can regenerate masks and crops without changing source object identity.

Key Takeaways

1. Whole-image embeddings are useful, but they are often too coarse for agent evidence.

2. Segmentation masks change the search unit from file to region.

3. Store masks, boxes, crops, foreground ratios, model versions, and source lineage together.

4. Embed whole images, foreground crops, and masked images when each view answers different queries.

5. Retrieval should be planned by query type. OCR, object filters, crop search, whole-image search, and reranking solve different parts of visual reasoning.

6. Agents need region evidence they can inspect and cite, not only a ranked list of image files.