NEWVectors or files. Pick a path.Start →
    Agent Perception
    20 min read
    Updated 2026-06-05

    Creative Ad Analysis for AI Agents: JEPA, Multi-Vector Retrieval, and Signal Fusion

    A practical guide to building agent perception pipelines for creative and ad analysis. Learn how video world models, OCR, ASR, object detection, multi-vector retrieval, and rank fusion work together before wiring the pattern into Mixpeek.

    Ad Analysis
    Agent Perception
    Video Understanding
    Multi-Vector Retrieval
    Creative Intelligence

    What an Ad Analysis Agent Needs to Perceive



    An ad is not one modality. A 30-second creative can contain a hook, product shots, scene transitions, spoken claims, on-screen text, logos, faces, music, pacing, and a final call to action. If an AI agent only receives a transcript or a single video embedding, it misses most of the evidence that a strategist, compliance reviewer, or media buyer would use.

    The agent's job is not simply to summarize the ad. It needs to answer grounded questions:

  1. What happens in the first three seconds?
  2. Which product attributes are shown, not just mentioned?
  3. Is the call to action visible, spoken, or both?
  4. Which scenes look similar to high-performing ads from last quarter?
  5. Does the ad contain a risky claim, competitor logo, restricted actor, or missing disclaimer?
  6. Which exact timestamp should a human review?


  7. Those questions require a perception pipeline. The pipeline decomposes the creative into searchable signals, indexes those signals, and returns evidence with timestamps when an agent asks for context.

    The Core Architecture



    Use this mental model:

    1. Segment the creative. Split video into shots, scenes, audio spans, OCR spans, and business moments such as hook, demo, proof, offer, and CTA. 2. Extract signals. Run different models for visual semantics, motion, speech, text, objects, faces, logos, music, and layout. 3. Store evidence records. Keep every extracted feature tied to asset id, timestamp, model version, confidence, and source URI. 4. Search in stages. Combine dense search, sparse search, multi-vector retrieval, filters, and rerankers. 5. Return grounded context. Give the agent a compact evidence bundle, not the raw video.

    The important design choice: do not force all ad meaning into one embedding. Ads are compositional. The retrieval layer should preserve that composition.

    Signal Families



    SignalBest model familyWhat it capturesExample agent query
    Visual semanticsCLIP, SigLIP, omnimodal embeddingsProducts, scenes, style, broad concepts"Find ads with a kitchen product demo"
    Motion and physical dynamicsJEPA-style video encoders, VideoPrism, action modelsMovement, action, pacing, cause and effect"Find ads where the user opens the package before the product reveal"
    SpeechASR models, diarizationVoiceover, claims, speakers, timing"Where does the ad mention free returns?"
    On-screen textOCR, visual document modelsCaptions, prices, disclaimers, promo codes"Find ads with a visible limited-time offer"
    Objects and logosYOLO, Grounding DINO, logo detectorsProducts, brand marks, restricted objects"Does a competitor logo appear in the background?"
    Face and identityFace detection, recognition, attributesTalent presence, spokespeople, likeness risk"Which ads include this creator?"
    Audio eventsCLAP-style embeddings, music classifiersMood, music, effects, silence"Find upbeat ads with applause or crowd noise"
    Performance metadataBI tables, campaign systemsCTR, conversions, spend, placement"Compare hooks from ads above 3% CTR"
    Each signal family answers a different question. The agent should choose the signal based on the task, then fuse results when the question spans modalities.

    Why JEPA-Style Video Features Matter



    Contrastive image-text models are strong for visual concepts: product, room, color, style, and scene. They are weaker when the meaning depends on time. For example:

  8. A person picks up a product, hesitates, then smiles.
  9. A before-and-after transition shows the product effect.
  10. A fast jump cut changes from problem to solution.
  11. A demo shows the sequence of steps needed to use the product.


  12. JEPA-style video encoders learn by predicting latent representations of missing or future parts of a video rather than reconstructing pixels. That makes them useful for motion, temporal continuity, and physical dynamics. For ad analysis, they are a good fit for questions about pacing, action, transitions, and event order.

    The practical pattern is not "replace CLIP with JEPA." The pattern is:

    1. Use CLIP or SigLIP-style embeddings for broad visual retrieval. 2. Use JEPA or VideoPrism-style features for temporal and action-sensitive retrieval. 3. Use ASR and OCR for exact claims and visible text. 4. Fuse the evidence at query time.

    The agent gets better because it can ask the right index instead of hoping one model captured every meaning.

    Why Multi-Vector Retrieval Matters



    A single dense vector compresses an entire scene into one point. That is efficient, but it loses token-level and patch-level detail. A query like "blue bottle next to a handwritten discount code while the narrator says subscribe" contains several conditions. A single vector may match the general scene but miss one requirement.

    Multi-vector models keep many vectors per item:

  13. Text: one vector per token or phrase.
  14. Documents: one vector per visual patch.
  15. Video: one vector per frame, patch, object, or segment.
  16. Audio: one vector per time span or acoustic event.


  17. Late interaction scoring then compares query vectors to item vectors and keeps the best matches. A simplified version:

    score(query, item) =
      sum over query_parts(
        max similarity(query_part, item_parts)
      )
    


    This preserves fine-grained matches. The tradeoff is cost. Multi-vector indexes store far more vectors and scoring is more expensive.

    Algorithms such as MUVERA address this by creating fixed dimensional encodings for multi-vector sets. Those encodings let a system retrieve candidate items with fast single-vector search, then rerank the candidates with exact multi-vector similarity. The design principle is useful even if you do not implement MUVERA directly:

    1. Use a cheap approximation to get candidates. 2. Use precise late interaction only on the candidate set. 3. Return evidence spans showing which parts matched.

    For creative analysis, this is the difference between "this ad is generally similar" and "the hook, product shot, and CTA each match the brief."

    A Retrieval Plan for Ad Questions



    Different questions should trigger different retrieval plans.

    Question: "Find ads with a strong opening hook"



    Use:

  18. Segment filter: first 0-5 seconds.
  19. Visual embedding search: find surprising or high-action opening scenes.
  20. ASR search: find spoken hooks, questions, or problem statements.
  21. OCR search: find large headline text.
  22. Performance filter: optionally restrict to ads above a CTR or conversion threshold.


  23. Return:

  24. Top hook segments.
  25. Timestamp.
  26. Transcript excerpt.
  27. Keyframe.
  28. Prior performance metadata.


  29. Question: "Is this ad compliant?"



    Use:

  30. ASR and OCR exact search for claims, pricing, disclaimers, medical language, financial language, or regulated terms.
  31. Logo and object detection for restricted brands or products.
  32. Face recognition for talent usage restrictions.
  33. Policy metadata filters by market, channel, and campaign.


  34. Return:

  35. Flagged evidence only.
  36. The rule that triggered the flag.
  37. Confidence and timestamp.
  38. Link to the original frame or audio span.


  39. Question: "Find creatives similar to this winning ad"



    Use:

  40. Whole-ad visual embeddings for broad similarity.
  41. Scene-level embeddings for reusable moments.
  42. JEPA-style video features for action and pacing similarity.
  43. Audio embeddings for music and energy.
  44. Metadata filters for vertical, format, placement, and aspect ratio.
  45. Reranking by performance lift or campaign objective.


  46. Return:

  47. Similar ads grouped by which signal matched.
  48. Specific matching scenes.
  49. Performance deltas.
  50. Reuse suggestions for new briefs.


  51. Fusion: How to Combine Signals



    Signal fusion is where most ad search systems become brittle. The problem is that scores from different models are not comparable. A CLIP cosine score of 0.31, a BM25 score of 17, and an OCR confidence of 0.94 do not live on the same scale.

    Common fusion methods:

    1. Reciprocal rank fusion. Merge ranked lists by rank position instead of raw score. This is robust when score scales differ. 2. Weighted rank fusion. Give more weight to signals that matter for the query. For compliance, OCR and ASR outrank style embeddings. For mood matching, visual and audio embeddings outrank OCR. 3. Learned reranking. Train a model that sees query, candidate evidence, and business metadata, then predicts final relevance. 4. Rule gates. Require a hard condition before ranking. Example: "must contain visible CTA" or "must be first 5 seconds."

    In agent workflows, prefer explicit fusion plans over a hidden global relevance score. The agent should know why a result matched.

    Evidence Bundles



    An agent should not receive 200 raw search hits. It should receive a small evidence bundle:

    {
      "asset_id": "ad_4831",
      "match_reason": "Opening hook uses product demo plus visible discount",
      "segments": [
        {
          "start": 0.8,
          "end": 4.6,
          "signals": {
            "visual": "hands open package on kitchen counter",
            "ocr": "20% OFF",
            "speech": "Meet the fastest way to meal prep"
          },
          "confidence": 0.91
        }
      ],
      "evidence_uris": [
        "s3://creative-library/ad_4831/keyframes/0001.jpg",
        "s3://creative-library/ad_4831/audio/0000-0005.wav"
      ]
    }
    


    This format lets the agent reason, cite, and ask for more detail only when needed.

    Evaluation



    Do not evaluate this pipeline with generic semantic search metrics alone. Use task-specific evals.

    EvalWhat it measuresGood target
    Hook recall@10Whether known strong hooks appear in top resultsHigh recall for first 5 seconds
    Timestamp IoUWhether retrieved spans overlap human-labeled spansHigh overlap for moment search
    Claim detection precisionWhether flagged claims are realLow false positives for compliance
    Cross-modal completenessWhether visual, speech, and text evidence are all presentNo missing required signal
    Agent answer faithfulnessWhether the agent only uses retrieved evidenceNo unsupported claims
    Review time savedWhether humans reach decisions fasterLower median review time
    The most useful eval set contains real creatives, real briefs, and real review outcomes. Synthetic examples help with coverage, but they rarely capture the ambiguity of actual ad review.

    Mixpeek Implementation Pattern



    Mixpeek handles this pattern as an indexing and retrieval system over objects. You connect the creative library, run extractors, and expose a retriever the agent can call.

    pip install mixpeek
    


    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    # Index creatives with multiple perception signals. collection = mx.collections.create( collection_name="ad_creatives", source={"type": "bucket", "uri": "s3://brand-creative-library"}, feature_extractors=[ {"feature": "video_embedding", "model": "facebook/vjepa2-vitg-fpc64-256"}, {"feature": "visual_embedding", "model": "google/siglip2-giant-opt-patch16-384"}, {"feature": "transcription", "model": "CohereLabs/cohere-transcribe-03-2026"}, {"feature": "ocr", "model": "PaddlePaddle/paddleocr"}, {"feature": "object_detection", "model": "IDEA-Research/grounding-dino-base"}, {"feature": "logo_detection"}, ], )

    # Build a retriever for agent questions. retriever = mx.retrievers.create( collection_id=collection.id, stages=[ {"type": "attribute_filter", "where": {"duration_seconds": {"lte": 45}}}, {"type": "feature_search", "feature": "visual_embedding", "top_k": 100}, {"type": "feature_search", "feature": "transcription", "top_k": 100}, {"type": "feature_search", "feature": "ocr", "top_k": 50}, {"type": "rank_fusion", "method": "rrf"}, {"type": "rerank", "model": "cross_encoder", "top_k": 10}, ], )

    results = mx.retrievers.execute( retriever_id=retriever.id, query="opening hook shows a product demo with a visible discount code", return_fields=["asset_id", "timestamps", "transcript", "ocr", "keyframe_url"], )


    For standalone vector storage, the same extracted features can be stored in MVS, Mixpeek's vector store on object storage. Use MVS when you already have embeddings and want dense, sparse, and hybrid search with object-storage economics. Use managed Mixpeek indexing when you want the system to extract faces, scenes, transcripts, OCR, logos, and other features from the original objects.

    Production Checklist



  52. Store raw objects and extracted features together.
  53. Keep model name, model version, extractor config, and timestamp with every feature.
  54. Segment videos before embedding them.
  55. Use OCR and ASR for claims. Do not rely on visual embeddings for exact text.
  56. Use JEPA-style or video-specific features when event order matters.
  57. Use multi-vector retrieval or reranking for complex creative briefs.
  58. Fuse rankings by query intent, not by one global score.
  59. Return evidence bundles with timestamps and source URIs.
  60. Evaluate against real review tasks, not only generic search benchmarks.
  61. Give agents retrieval controls: top-k, filters, budgets, cancellation, and evidence-only response modes.


  62. Key Takeaways



  63. Creative analysis is an agent perception problem, not a summarization problem.
  64. One embedding is not enough for ads because ad meaning is distributed across motion, speech, text, objects, audio, and metadata.
  65. JEPA-style video encoders are useful for temporal dynamics, while CLIP and SigLIP-style models remain strong for broad visual semantics.
  66. Multi-vector retrieval preserves fine-grained evidence, and MUVERA-style approximation makes the pattern more practical at scale.
  67. The agent should receive grounded evidence bundles with timestamps, not raw media dumps.


  68. Related Resources



  69. Creative DNA -- Mixpeek's creative library workflow
  70. Advertising Technology Solutions -- adtech use cases
  71. Late Interaction Retrieval -- ColBERT, ColPali, and ColQwen architecture
  72. Retrieval Control Planes for AI Agents -- streaming, cancellation, and budgets
  73. Agent Perception Evals -- testing whether agents can see, hear, and search
  74. Models -- compare current embedding, video, audio, and detection models
  75. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs