Creative Ad Analysis for AI Agents: JEPA, Multi-Vector Retrieval, and Signal Fusion

What an Ad Analysis Agent Needs to Perceive

An ad is not one modality. A 30-second creative can contain a hook, product shots, scene transitions, spoken claims, on-screen text, logos, faces, music, pacing, and a final call to action. If an AI agent only receives a transcript or a single video embedding, it misses most of the evidence that a strategist, compliance reviewer, or media buyer would use.

The agent's job is not simply to summarize the ad. It needs to answer grounded questions:

What happens in the first three seconds?

Which product attributes are shown, not just mentioned?

Is the call to action visible, spoken, or both?

Which scenes look similar to high-performing ads from last quarter?

Does the ad contain a risky claim, competitor logo, restricted actor, or missing disclaimer?

Which exact timestamp should a human review?

Those questions require a perception pipeline. The pipeline decomposes the creative into searchable signals, indexes those signals, and returns evidence with timestamps when an agent asks for context.

The Core Architecture

Use this mental model:

1. Segment the creative. Split video into shots, scenes, audio spans, OCR spans, and business moments such as hook, demo, proof, offer, and CTA. 2. Extract signals. Run different models for visual semantics, motion, speech, text, objects, faces, logos, music, and layout. 3. Store evidence records. Keep every extracted feature tied to asset id, timestamp, model version, confidence, and source URI. 4. Search in stages. Combine dense search, sparse search, multi-vector retrieval, filters, and rerankers. 5. Return grounded context. Give the agent a compact evidence bundle, not the raw video.

The important design choice: do not force all ad meaning into one embedding. Ads are compositional. The retrieval layer should preserve that composition.

Signal Families

Signal

Best model family

What it captures

Example agent query

Visual semantics	CLIP, SigLIP, omnimodal embeddings	Products, scenes, style, broad concepts	"Find ads with a kitchen product demo"
Motion and physical dynamics	JEPA-style video encoders, VideoPrism, action models	Movement, action, pacing, cause and effect	"Find ads where the user opens the package before the product reveal"
Speech	ASR models, diarization	Voiceover, claims, speakers, timing	"Where does the ad mention free returns?"
On-screen text	OCR, visual document models	Captions, prices, disclaimers, promo codes	"Find ads with a visible limited-time offer"
Objects and logos	YOLO, Grounding DINO, logo detectors	Products, brand marks, restricted objects	"Does a competitor logo appear in the background?"
Face and identity	Face detection, recognition, attributes	Talent presence, spokespeople, likeness risk	"Which ads include this creator?"
Audio events	CLAP-style embeddings, music classifiers	Mood, music, effects, silence	"Find upbeat ads with applause or crowd noise"
Performance metadata	BI tables, campaign systems	CTR, conversions, spend, placement	"Compare hooks from ads above 3% CTR"

Each signal family answers a different question. The agent should choose the signal based on the task, then fuse results when the question spans modalities.

Why JEPA-Style Video Features Matter

Contrastive image-text models are strong for visual concepts: product, room, color, style, and scene. They are weaker when the meaning depends on time. For example:

A person picks up a product, hesitates, then smiles.

A before-and-after transition shows the product effect.

A fast jump cut changes from problem to solution.

A demo shows the sequence of steps needed to use the product.

JEPA-style video encoders learn by predicting latent representations of missing or future parts of a video rather than reconstructing pixels. That makes them useful for motion, temporal continuity, and physical dynamics. For ad analysis, they are a good fit for questions about pacing, action, transitions, and event order.

The practical pattern is not "replace CLIP with JEPA." The pattern is:

1. Use CLIP or SigLIP-style embeddings for broad visual retrieval. 2. Use JEPA or VideoPrism-style features for temporal and action-sensitive retrieval. 3. Use ASR and OCR for exact claims and visible text. 4. Fuse the evidence at query time.

The agent gets better because it can ask the right index instead of hoping one model captured every meaning.

Why Multi-Vector Retrieval Matters

A single dense vector compresses an entire scene into one point. That is efficient, but it loses token-level and patch-level detail. A query like "blue bottle next to a handwritten discount code while the narrator says subscribe" contains several conditions. A single vector may match the general scene but miss one requirement.

Multi-vector models keep many vectors per item:

Text: one vector per token or phrase.

Documents: one vector per visual patch.

Video: one vector per frame, patch, object, or segment.

Audio: one vector per time span or acoustic event.

Late interaction scoring then compares query vectors to item vectors and keeps the best matches. A simplified version:

score(query, item) =
  sum over query_parts(
    max similarity(query_part, item_parts)
  )

This preserves fine-grained matches. The tradeoff is cost. Multi-vector indexes store far more vectors and scoring is more expensive.

Algorithms such as MUVERA address this by creating fixed dimensional encodings for multi-vector sets. Those encodings let a system retrieve candidate items with fast single-vector search, then rerank the candidates with exact multi-vector similarity. The design principle is useful even if you do not implement MUVERA directly:

1. Use a cheap approximation to get candidates. 2. Use precise late interaction only on the candidate set. 3. Return evidence spans showing which parts matched.

For creative analysis, this is the difference between "this ad is generally similar" and "the hook, product shot, and CTA each match the brief."

A Retrieval Plan for Ad Questions

Different questions should trigger different retrieval plans.

Question: "Find ads with a strong opening hook"

Use:

Segment filter: first 0-5 seconds.

Visual embedding search: find surprising or high-action opening scenes.

ASR search: find spoken hooks, questions, or problem statements.

OCR search: find large headline text.

Performance filter: optionally restrict to ads above a CTR or conversion threshold.

Return:

Top hook segments.

Timestamp.

Transcript excerpt.

Keyframe.

Prior performance metadata.

Question: "Is this ad compliant?"

Use:

ASR and OCR exact search for claims, pricing, disclaimers, medical language, financial language, or regulated terms.

Logo and object detection for restricted brands or products.

Face recognition for talent usage restrictions.

Policy metadata filters by market, channel, and campaign.

Return:

Flagged evidence only.

The rule that triggered the flag.

Confidence and timestamp.

Link to the original frame or audio span.

Question: "Find creatives similar to this winning ad"

Use:

Whole-ad visual embeddings for broad similarity.

Scene-level embeddings for reusable moments.

JEPA-style video features for action and pacing similarity.

Audio embeddings for music and energy.

Metadata filters for vertical, format, placement, and aspect ratio.

Reranking by performance lift or campaign objective.

Return:

Similar ads grouped by which signal matched.

Specific matching scenes.

Performance deltas.

Reuse suggestions for new briefs.

Fusion: How to Combine Signals

Signal fusion is where most ad search systems become brittle. The problem is that scores from different models are not comparable. A CLIP cosine score of 0.31, a BM25 score of 17, and an OCR confidence of 0.94 do not live on the same scale.

Common fusion methods:

1. Reciprocal rank fusion. Merge ranked lists by rank position instead of raw score. This is robust when score scales differ. 2. Weighted rank fusion. Give more weight to signals that matter for the query. For compliance, OCR and ASR outrank style embeddings. For mood matching, visual and audio embeddings outrank OCR. 3. Learned reranking. Train a model that sees query, candidate evidence, and business metadata, then predicts final relevance. 4. Rule gates. Require a hard condition before ranking. Example: "must contain visible CTA" or "must be first 5 seconds."

In agent workflows, prefer explicit fusion plans over a hidden global relevance score. The agent should know why a result matched.

Evidence Bundles

An agent should not receive 200 raw search hits. It should receive a small evidence bundle:

{
  "asset_id": "ad_4831",
  "match_reason": "Opening hook uses product demo plus visible discount",
  "segments": [
    {
      "start": 0.8,
      "end": 4.6,
      "signals": {
        "visual": "hands open package on kitchen counter",
        "ocr": "20% OFF",
        "speech": "Meet the fastest way to meal prep"
      },
      "confidence": 0.91
    }
  ],
  "evidence_uris": [
    "s3://creative-library/ad_4831/keyframes/0001.jpg",
    "s3://creative-library/ad_4831/audio/0000-0005.wav"
  ]
}

This format lets the agent reason, cite, and ask for more detail only when needed.

Evaluation

Do not evaluate this pipeline with generic semantic search metrics alone. Use task-specific evals.

Eval

What it measures

Good target

Hook recall@10	Whether known strong hooks appear in top results	High recall for first 5 seconds
Timestamp IoU	Whether retrieved spans overlap human-labeled spans	High overlap for moment search
Claim detection precision	Whether flagged claims are real	Low false positives for compliance
Cross-modal completeness	Whether visual, speech, and text evidence are all present	No missing required signal
Agent answer faithfulness	Whether the agent only uses retrieved evidence	No unsupported claims
Review time saved	Whether humans reach decisions faster	Lower median review time

The most useful eval set contains real creatives, real briefs, and real review outcomes. Synthetic examples help with coverage, but they rarely capture the ambiguity of actual ad review.

Mixpeek Implementation Pattern

Mixpeek handles this pattern as an indexing and retrieval system over objects. You connect the creative library, run extractors, and expose a retriever the agent can call.

pip install mixpeek

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

# Index creatives with multiple perception signals.
collection = mx.collections.create(
    collection_name="ad_creatives",
    source={"type": "bucket", "uri": "s3://brand-creative-library"},
    feature_extractors=[
        {"feature": "video_embedding", "model": "facebook/vjepa2-vitg-fpc64-256"},
        {"feature": "visual_embedding", "model": "google/siglip2-giant-opt-patch16-384"},
        {"feature": "transcription", "model": "CohereLabs/cohere-transcribe-03-2026"},
        {"feature": "ocr", "model": "PaddlePaddle/paddleocr"},
        {"feature": "object_detection", "model": "IDEA-Research/grounding-dino-base"},
        {"feature": "logo_detection"},
    ],
)

# Build a retriever for agent questions.
retriever = mx.retrievers.create(
    collection_id=collection.id,
    stages=[
        {"type": "attribute_filter", "where": {"duration_seconds": {"lte": 45}}},
        {"type": "feature_search", "feature": "visual_embedding", "top_k": 100},
        {"type": "feature_search", "feature": "transcription", "top_k": 100},
        {"type": "feature_search", "feature": "ocr", "top_k": 50},
        {"type": "rank_fusion", "method": "rrf"},
        {"type": "rerank", "model": "cross_encoder", "top_k": 10},
    ],
)

results = mx.retrievers.execute(
    retriever_id=retriever.id,
    query="opening hook shows a product demo with a visible discount code",
    return_fields=["asset_id", "timestamps", "transcript", "ocr", "keyframe_url"],
)

For standalone vector storage, the same extracted features can be stored in MVS, Mixpeek's vector store on object storage. Use MVS when you already have embeddings and want dense, sparse, and hybrid search with object-storage economics. Use managed Mixpeek indexing when you want the system to extract faces, scenes, transcripts, OCR, logos, and other features from the original objects.

Production Checklist

Store raw objects and extracted features together.

Keep model name, model version, extractor config, and timestamp with every feature.

Segment videos before embedding them.

Use OCR and ASR for claims. Do not rely on visual embeddings for exact text.

Use JEPA-style or video-specific features when event order matters.

Use multi-vector retrieval or reranking for complex creative briefs.

Fuse rankings by query intent, not by one global score.

Return evidence bundles with timestamps and source URIs.

Evaluate against real review tasks, not only generic search benchmarks.

Give agents retrieval controls: top-k, filters, budgets, cancellation, and evidence-only response modes.

Key Takeaways

Creative analysis is an agent perception problem, not a summarization problem.

One embedding is not enough for ads because ad meaning is distributed across motion, speech, text, objects, audio, and metadata.

JEPA-style video encoders are useful for temporal dynamics, while CLIP and SigLIP-style models remain strong for broad visual semantics.

Multi-vector retrieval preserves fine-grained evidence, and MUVERA-style approximation makes the pattern more practical at scale.

The agent should receive grounded evidence bundles with timestamps, not raw media dumps.

Related Resources

Creative DNA -- Mixpeek's creative library workflow

Advertising Technology Solutions -- adtech use cases

Late Interaction Retrieval -- ColBERT, ColPali, and ColQwen architecture

Retrieval Control Planes for AI Agents -- streaming, cancellation, and budgets

Agent Perception Evals -- testing whether agents can see, hear, and search

Models -- compare current embedding, video, audio, and detection models

What an Ad Analysis Agent Needs to Perceive

The Core Architecture

Signal Families

Why JEPA-Style Video Features Matter

Why Multi-Vector Retrieval Matters

A Retrieval Plan for Ad Questions

Question: "Find ads with a strong opening hook"

Question: "Is this ad compliant?"

Question: "Find creatives similar to this winning ad"

Fusion: How to Combine Signals

Evidence Bundles

Evaluation

Mixpeek Implementation Pattern

Production Checklist

Key Takeaways

Related Resources

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

Object Decomposition and Layered Indexing for AI Agent Perception

Long-Context Video Understanding for Agent Perception