What an Ad Analysis Agent Needs to Perceive
An ad is not one modality. A 30-second creative can contain a hook, product shots, scene transitions, spoken claims, on-screen text, logos, faces, music, pacing, and a final call to action. If an AI agent only receives a transcript or a single video embedding, it misses most of the evidence that a strategist, compliance reviewer, or media buyer would use.
The agent's job is not simply to summarize the ad. It needs to answer grounded questions:
Those questions require a perception pipeline. The pipeline decomposes the creative into searchable signals, indexes those signals, and returns evidence with timestamps when an agent asks for context.
The Core Architecture
Use this mental model:
1. Segment the creative. Split video into shots, scenes, audio spans, OCR spans, and business moments such as hook, demo, proof, offer, and CTA. 2. Extract signals. Run different models for visual semantics, motion, speech, text, objects, faces, logos, music, and layout. 3. Store evidence records. Keep every extracted feature tied to asset id, timestamp, model version, confidence, and source URI. 4. Search in stages. Combine dense search, sparse search, multi-vector retrieval, filters, and rerankers. 5. Return grounded context. Give the agent a compact evidence bundle, not the raw video.
The important design choice: do not force all ad meaning into one embedding. Ads are compositional. The retrieval layer should preserve that composition.
Signal Families
| Signal | Best model family | What it captures | Example agent query |
| Visual semantics | CLIP, SigLIP, omnimodal embeddings | Products, scenes, style, broad concepts | "Find ads with a kitchen product demo" |
| Motion and physical dynamics | JEPA-style video encoders, VideoPrism, action models | Movement, action, pacing, cause and effect | "Find ads where the user opens the package before the product reveal" |
| Speech | ASR models, diarization | Voiceover, claims, speakers, timing | "Where does the ad mention free returns?" |
| On-screen text | OCR, visual document models | Captions, prices, disclaimers, promo codes | "Find ads with a visible limited-time offer" |
| Objects and logos | YOLO, Grounding DINO, logo detectors | Products, brand marks, restricted objects | "Does a competitor logo appear in the background?" |
| Face and identity | Face detection, recognition, attributes | Talent presence, spokespeople, likeness risk | "Which ads include this creator?" |
| Audio events | CLAP-style embeddings, music classifiers | Mood, music, effects, silence | "Find upbeat ads with applause or crowd noise" |
| Performance metadata | BI tables, campaign systems | CTR, conversions, spend, placement | "Compare hooks from ads above 3% CTR" |
Why JEPA-Style Video Features Matter
Contrastive image-text models are strong for visual concepts: product, room, color, style, and scene. They are weaker when the meaning depends on time. For example:
JEPA-style video encoders learn by predicting latent representations of missing or future parts of a video rather than reconstructing pixels. That makes them useful for motion, temporal continuity, and physical dynamics. For ad analysis, they are a good fit for questions about pacing, action, transitions, and event order.
The practical pattern is not "replace CLIP with JEPA." The pattern is:
1. Use CLIP or SigLIP-style embeddings for broad visual retrieval. 2. Use JEPA or VideoPrism-style features for temporal and action-sensitive retrieval. 3. Use ASR and OCR for exact claims and visible text. 4. Fuse the evidence at query time.
The agent gets better because it can ask the right index instead of hoping one model captured every meaning.
Why Multi-Vector Retrieval Matters
A single dense vector compresses an entire scene into one point. That is efficient, but it loses token-level and patch-level detail. A query like "blue bottle next to a handwritten discount code while the narrator says subscribe" contains several conditions. A single vector may match the general scene but miss one requirement.
Multi-vector models keep many vectors per item:
Late interaction scoring then compares query vectors to item vectors and keeps the best matches. A simplified version:
score(query, item) =
sum over query_parts(
max similarity(query_part, item_parts)
)
This preserves fine-grained matches. The tradeoff is cost. Multi-vector indexes store far more vectors and scoring is more expensive.
Algorithms such as MUVERA address this by creating fixed dimensional encodings for multi-vector sets. Those encodings let a system retrieve candidate items with fast single-vector search, then rerank the candidates with exact multi-vector similarity. The design principle is useful even if you do not implement MUVERA directly:
1. Use a cheap approximation to get candidates. 2. Use precise late interaction only on the candidate set. 3. Return evidence spans showing which parts matched.
For creative analysis, this is the difference between "this ad is generally similar" and "the hook, product shot, and CTA each match the brief."
A Retrieval Plan for Ad Questions
Different questions should trigger different retrieval plans.
Question: "Find ads with a strong opening hook"
Use:
Return:
Question: "Is this ad compliant?"
Use:
Return:
Question: "Find creatives similar to this winning ad"
Use:
Return:
Fusion: How to Combine Signals
Signal fusion is where most ad search systems become brittle. The problem is that scores from different models are not comparable. A CLIP cosine score of 0.31, a BM25 score of 17, and an OCR confidence of 0.94 do not live on the same scale.
Common fusion methods:
1. Reciprocal rank fusion. Merge ranked lists by rank position instead of raw score. This is robust when score scales differ. 2. Weighted rank fusion. Give more weight to signals that matter for the query. For compliance, OCR and ASR outrank style embeddings. For mood matching, visual and audio embeddings outrank OCR. 3. Learned reranking. Train a model that sees query, candidate evidence, and business metadata, then predicts final relevance. 4. Rule gates. Require a hard condition before ranking. Example: "must contain visible CTA" or "must be first 5 seconds."
In agent workflows, prefer explicit fusion plans over a hidden global relevance score. The agent should know why a result matched.
Evidence Bundles
An agent should not receive 200 raw search hits. It should receive a small evidence bundle:
{
"asset_id": "ad_4831",
"match_reason": "Opening hook uses product demo plus visible discount",
"segments": [
{
"start": 0.8,
"end": 4.6,
"signals": {
"visual": "hands open package on kitchen counter",
"ocr": "20% OFF",
"speech": "Meet the fastest way to meal prep"
},
"confidence": 0.91
}
],
"evidence_uris": [
"s3://creative-library/ad_4831/keyframes/0001.jpg",
"s3://creative-library/ad_4831/audio/0000-0005.wav"
]
}
This format lets the agent reason, cite, and ask for more detail only when needed.
Evaluation
Do not evaluate this pipeline with generic semantic search metrics alone. Use task-specific evals.
| Eval | What it measures | Good target |
| Hook recall@10 | Whether known strong hooks appear in top results | High recall for first 5 seconds |
| Timestamp IoU | Whether retrieved spans overlap human-labeled spans | High overlap for moment search |
| Claim detection precision | Whether flagged claims are real | Low false positives for compliance |
| Cross-modal completeness | Whether visual, speech, and text evidence are all present | No missing required signal |
| Agent answer faithfulness | Whether the agent only uses retrieved evidence | No unsupported claims |
| Review time saved | Whether humans reach decisions faster | Lower median review time |
Mixpeek Implementation Pattern
Mixpeek handles this pattern as an indexing and retrieval system over objects. You connect the creative library, run extractors, and expose a retriever the agent can call.
pip install mixpeek
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
# Index creatives with multiple perception signals.
collection = mx.collections.create(
collection_name="ad_creatives",
source={"type": "bucket", "uri": "s3://brand-creative-library"},
feature_extractors=[
{"feature": "video_embedding", "model": "facebook/vjepa2-vitg-fpc64-256"},
{"feature": "visual_embedding", "model": "google/siglip2-giant-opt-patch16-384"},
{"feature": "transcription", "model": "CohereLabs/cohere-transcribe-03-2026"},
{"feature": "ocr", "model": "PaddlePaddle/paddleocr"},
{"feature": "object_detection", "model": "IDEA-Research/grounding-dino-base"},
{"feature": "logo_detection"},
],
)
# Build a retriever for agent questions.
retriever = mx.retrievers.create(
collection_id=collection.id,
stages=[
{"type": "attribute_filter", "where": {"duration_seconds": {"lte": 45}}},
{"type": "feature_search", "feature": "visual_embedding", "top_k": 100},
{"type": "feature_search", "feature": "transcription", "top_k": 100},
{"type": "feature_search", "feature": "ocr", "top_k": 50},
{"type": "rank_fusion", "method": "rrf"},
{"type": "rerank", "model": "cross_encoder", "top_k": 10},
],
)
results = mx.retrievers.execute(
retriever_id=retriever.id,
query="opening hook shows a product demo with a visible discount code",
return_fields=["asset_id", "timestamps", "transcript", "ocr", "keyframe_url"],
)
For standalone vector storage, the same extracted features can be stored in MVS, Mixpeek's vector store on object storage. Use MVS when you already have embeddings and want dense, sparse, and hybrid search with object-storage economics. Use managed Mixpeek indexing when you want the system to extract faces, scenes, transcripts, OCR, logos, and other features from the original objects.