Why Whole-Image Retrieval Fails Agents
Whole-image embeddings are useful for broad visual similarity. They can find photos with similar style, product category, scene type, or layout. But many agent tasks are not about the whole image. They are about a region:
A single image vector compresses everything into one point. Large background regions, lighting, camera angle, and unrelated objects can dominate the representation. The agent may retrieve an image that feels visually similar while missing the actual evidence region.
Mask-aware retrieval fixes the unit of search. Instead of indexing only the whole image, the system indexes foreground objects, regions, crops, masks, and spatial metadata. The agent can then search what matters, cite where it appears, and hand a focused crop to a downstream vision-language model.
This passes the hard gate for agent perception: it helps an AI agent see and search unstructured visual content.
The Basic Pattern
Mask-aware retrieval has five stages:
1. Generate candidate regions. Use segmentation, detection, OCR, or saliency models to find regions that may matter. 2. Store the mask and crop. Preserve pixel mask, bounding box, foreground ratio, crop URI, and source lineage. 3. Embed multiple views. Embed the whole image, the masked foreground, and the crop when each view may answer a different query. 4. Search in stages. Use metadata filters, whole-image recall, crop search, and reranking instead of one nearest-neighbor call. 5. Return inspectable evidence. Give the agent source URI, region coordinates, mask ID, crop URI, model version, and confidence.
The important shift is from file-level search to evidence-region search.
What a Mask Represents
A segmentation mask is usually a binary or probabilistic image aligned to the source image. Each pixel says whether it belongs to a region.
The mask can be represented several ways:
| Representation | What it stores | Best use |
| Binary bitmap | One bit or byte per pixel | Precise local processing |
| Probability map | Float confidence per pixel | Thresholding and uncertainty |
| Run-length encoding | Compact spans of foreground pixels | Storage and transport |
| Polygon | Boundary vertices | UI overlays and approximate geometry |
| Bounding box | x, y, width, height | Fast filters and crop creation |
Four Ways to Create Regions
1. Detector First
Object detectors return boxes and labels. This is the fastest path when the query space is known:
The detector gives a box. A segmentation model can refine the box into a tighter mask. This two-step pattern works well when you want class labels and clean crops.
2. Segmentation First
Prompt-free foreground segmentation returns the most salient object or foreground region without requiring a label. This is useful for product images, thumbnails, screenshots, and creative assets where the foreground object matters more than its class.
Models like BiRefNet are useful in this layer because they target foreground/background and salient-object masks. The output is not "cat" or "chair." The output is "this is the visible region worth isolating."
3. Promptable Segmentation
SAM-style models can segment from points, boxes, or prompts. Use this when an agent or UI already knows the approximate region:
Promptable segmentation is strong as a second pass. It is less useful as the only discovery mechanism unless another system proposes prompts.
4. OCR and Layout Regions
For screenshots, documents, dashboards, and ads, text regions are often the best masks. OCR supplies boxes for words, lines, and blocks. Layout models supply panels, cards, forms, and tables.
These are visual regions even though they come from text extraction. A search for "error message near payment button" needs OCR text, spatial layout, and maybe a cropped UI panel.
The Region Record
A useful region record is more than a crop. It is a small evidence object.
{
"region_id": "img_924:frame_00018:mask_003:birefnet:v1",
"source_uri": "s3://brand-assets/launch/ad_17.mp4",
"frame_time_ms": 6000,
"modality": "image",
"region_type": "foreground_mask",
"box": {"x": 0.18, "y": 0.22, "w": 0.41, "h": 0.52},
"foreground_ratio": 0.31,
"mask_uri": "s3://derived/ad_17/frame_00018/mask_003.rle",
"crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
"model_id": "ZhengPeng7/BiRefNet",
"extractor_version": "segmentation@v1",
"confidence": 0.91
}
This record lets an agent do several things:
Without this metadata, a crop becomes an orphaned image. It may be searchable, but it is not reliable evidence.
Which Views to Embed
Mask-aware retrieval usually stores more than one embedding per visual source.
Whole Image
Whole-image embeddings preserve context. They are good for broad questions:
Whole-image search is a good first-stage recall channel, but it often misses small regions.
Foreground Crop
Crop embeddings focus the model on the object or panel. They are good for:
Crops can overfocus. If the agent needs surrounding context, return the crop plus the parent frame.
Masked Image
A masked image keeps the original image dimensions but removes or fades background pixels. This preserves some layout and scale while reducing background dominance.
Masked images work well when the crop alone loses context:
Region Caption
Some regions benefit from a caption or structured label:
Captions are not replacements for embeddings. They are a sparse or lexical channel that helps exact search, filters, and agent explanations.
Retrieval Plans
Do not route every visual query through the same index. Match the retrieval plan to the question.
Query: "Find similar product photos"
Plan:
1. Search crop embeddings from foreground masks. 2. Filter for foreground_ratio above a threshold. 3. Rerank by product metadata, color, or category if available. 4. Return crop and parent image.
Why: the product matters more than the studio background.
Query: "Find screenshots with a payment error"
Plan:
1. Search OCR text for "payment", "declined", "failed", and related terms. 2. Search region captions for error panels and toast messages. 3. Join OCR regions to nearby UI-panel masks. 4. Return the panel crop and full screenshot.
Why: exact text and spatial layout matter more than broad visual similarity.
Query: "Find frames where the package is visible during the spoken CTA"
Plan:
1. Search transcript spans for CTA language. 2. Search foreground crops for package-like regions. 3. Join regions and transcript spans by overlapping timestamps. 4. Rerank clips where the crop occupies enough area and persists across frames.
Why: the query is temporal and cross-modal. A visual match alone is insufficient.
Query: "Show ads where the background distracts from the product"
Plan:
1. Compute foreground ratio and background salience. 2. Search whole-image embeddings for busy scenes. 3. Compare whole-image similarity against crop similarity. 4. Flag images where the crop matches the product query but the whole image retrieves unrelated background concepts.
Why: this is a quality-control query about foreground/background separation.
Scoring and Fusion
Mask-aware retrieval creates several scores:
Do not treat these scores as directly comparable. A cosine score from an image embedding is not the same thing as OCR BM25, detector confidence, or temporal overlap.
Use a fusion strategy:
| Strategy | Use when |
| Hard filters | The condition must be true, such as foreground_ratio >= 0.2 |
| Reciprocal rank fusion | Several retrieval channels produce ranked lists |
| Weighted fusion | Query classes are known and weights can be validated |
| Reranking | A smaller candidate set needs visual or multimodal inspection |
| Rules plus retrieval | Policy requires exact constraints before semantic search |
Failure Modes
Background Leakage
The crop includes too much background, so retrieval still matches on scene instead of object. Tighten the mask, pad the box less, or use a masked image instead of a loose crop.
Overcropping
The crop removes context needed for meaning. A product in a hand, a warning label on a machine, or an icon inside a UI panel may need surrounding pixels. Return parent context with every crop.
Duplicate Regions
Segmenters and detectors may produce overlapping regions. Use non-maximum suppression, mask IoU, or embedding deduplication to avoid indexing the same object many times.
Small Object Blindness
Small objects may be missed by whole-frame keyframes. Increase frame resolution, use detector-first proposals, or index tiled regions.
Mask Confidence Drift
A mask model trained on clean product photos may perform poorly on CCTV, medical images, or low-light video. Track confidence distributions by source type and evaluate region quality on real data.
Region Without Lineage
If a crop loses its source URI, timestamp, model version, or box, it cannot support an agent answer. Treat missing lineage as an ingestion bug.
Evaluation
Evaluate mask-aware retrieval at the region level, not only the image level.
Useful metrics:
The false-region rate is especially important. If the agent gets the right image but cites the wrong region, the final answer can be confidently wrong.
Agent Tool Design
Expose region search as a bounded tool. A good tool returns evidence handles, not raw pixels in the main context.
{
"tool": "search_visual_regions",
"input": {
"query": "red handheld scanner",
"collection": "retail_media",
"region_types": ["foreground_mask", "object_box"],
"min_foreground_ratio": 0.15,
"limit": 10
},
"output": {
"regions": [
{
"region_id": "img_924:frame_00018:mask_003:birefnet:v1",
"source_uri": "s3://brand-assets/launch/ad_17.mp4",
"timestamp_ms": 6000,
"crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
"box": {"x": 0.18, "y": 0.22, "w": 0.41, "h": 0.52},
"matched_channels": ["crop_embedding", "region_caption"],
"scores": {"crop_rank": 1, "caption_rank": 4}
}
]
}
}
The agent can then call a follow-up tool:
This keeps visual reasoning iterative without flooding the context window with pixels.
How This Maps to Mixpeek and MVS
Mixpeek's managed system can extract segmentation, object, OCR, caption, and embedding features from the media already in object storage. MVS can store and search the vector layer for teams bringing their own crops or embeddings.
Managed ingestion example:
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
mx.collections.ingest(
collection_id="visual-evidence",
source={"url": "s3://brand-assets/launch/"},
feature_extractors=[
{
"name": "segmentation",
"version": "v1",
"params": {
"model_id": "ZhengPeng7/BiRefNet",
"output_masks": True,
"store_crops": True
}
},
{
"name": "visual_embeddings",
"version": "v1",
"params": {
"model_id": "nomic-ai/nomic-embed-vision-v1.5",
"embed_views": ["whole_image", "foreground_crop", "masked_image"]
}
},
{
"name": "ocr",
"version": "v1"
}
]
)
Retriever example:
results = mx.retrievers.retrieve(
retriever_id="visual-region-agent",
query="frames where the package is visible during the spoken call to action",
pipeline=[
{
"stage_type": "search",
"stage_id": "cta_transcript",
"feature": "transcription",
"limit": 100
},
{
"stage_type": "search",
"stage_id": "package_crops",
"feature": "visual_embeddings",
"view": "foreground_crop",
"limit": 100
},
{
"stage_type": "join",
"stage_id": "same_time_window",
"on": "timestamp_overlap",
"window_seconds": 2
},
{
"stage_type": "fusion",
"stage_id": "ranked_evidence",
"method": "reciprocal_rank_fusion",
"limit": 20
}
]
)
MVS standalone example for bring-your-own crops:
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
mx.mvs.upsert(
namespace="visual-region-memory",
vectors=[
{
"id": "img_924:frame_00018:mask_003:nomic_v15",
"values": crop_embedding,
"metadata": {
"source_uri": "s3://brand-assets/launch/ad_17.mp4",
"crop_uri": "s3://derived/ad_17/frame_00018/crop_003.webp",
"mask_uri": "s3://derived/ad_17/frame_00018/mask_003.rle",
"frame_time_ms": 6000,
"model_id": "nomic-ai/nomic-embed-vision-v1.5",
"region_type": "foreground_mask",
"foreground_ratio": 0.31
}
}
]
)
The same design principle applies either way: preserve region lineage, search the right visual unit, and return evidence the agent can inspect.
Design Checklist
Key Takeaways
1. Whole-image embeddings are useful, but they are often too coarse for agent evidence.
2. Segmentation masks change the search unit from file to region.
3. Store masks, boxes, crops, foreground ratios, model versions, and source lineage together.
4. Embed whole images, foreground crops, and masked images when each view answers different queries.
5. Retrieval should be planned by query type. OCR, object filters, crop search, whole-image search, and reranking solve different parts of visual reasoning.
6. Agents need region evidence they can inspect and cite, not only a ranked list of image files.