NEWVectors or files. Pick a path.Start →

    Guides

    Vendor-neutral, engineer-written guides to the concepts behind multimodal AI — perception, retrieval, embeddings, and the infrastructure agents use to see, hear, and search unstructured data. Learn the idea first; then see how Mixpeek applies it.

    76 guides across 14 topics

    Perception
    16 min read

    Multi-Object Tracking: How Agents Follow Objects Across Video Frames

    A vendor-neutral guide to tracking-by-detection — motion prediction with a Kalman filter, data association via IoU and the Hungarian algorithm, ByteTrack's low-score recovery, appearance re-identification, and the ID-switch problem — the pipeline that turns per-frame detections into stable per-object tracks an agent can reason over across time.

    Multi-Object Tracking
    ByteTrack
    Kalman Filter
    Jul 2026Read guide
    Perception
    15 min read

    Monocular Depth Estimation: How Models Infer 3D From a Single Image

    A vendor-neutral guide to inferring depth from one 2D image — why the problem is ill-posed, the pictorial cues models learn, relative vs metric depth and scale ambiguity, how self-supervision trains depth without ground truth, and why a depth channel lets an agent reason about scene geometry it otherwise can't see.

    Depth Estimation
    Monocular Depth
    Depth Anything
    Jul 2026Read guide
    Perception
    16 min read

    Instance-Level Visual Matching: Finding the Same Object, Not Just Similar Ones

    A vendor-neutral guide to geometric visual matching — keypoint detection, local descriptors, descriptor matching, and RANSAC geometric verification — the pipeline an agent uses to confirm two images contain the *same* physical object or scene, which a similarity embedding cannot decide on its own.

    Keypoint Matching
    SuperPoint
    RANSAC
    Jun 2026Read guide
    Perception
    17 min read

    Face Recognition and Identity Clustering: How Agents Recognize and Group People in Video

    A vendor-neutral walk through the face pipeline an agent uses to answer 'who is in this footage?' and 'find every clip with this person' — detection, alignment, metric-learned embeddings (ArcFace's angular margin), verification vs identification, and the unsupervised identity-clustering problem.

    Face Recognition
    ArcFace
    Metric Learning
    Jun 2026Read guide
    Retrieval
    20 min read

    Reasoning Rerankers: How Listwise LLM Rerankers Reorder Retrieval Results

    How listwise LLM rerankers (RankGPT-style) and reasoning rerankers (Qwen3-Reranker, Nemotron-style) reorder candidate sets by generating a permutation rather than scoring documents independently, why considering the whole list at once captures signals pointwise cross-encoders miss, the sliding-window strategy, positional bias and its fixes, distillation into cheap rerankers, and budget-aware per-query reranker selection for agents.

    Listwise Reranking
    LLM Reranker
    RankGPT
    Jun 2026Read guide
    Retrieval
    21 min read

    Retrieval Feedback Loops: Learning to Rank from Clicks, Outcomes, and Agent Interactions

    How a ranked list becomes a hypothesis that interactions test, why naive 'clicked = relevant' is wrong, and how click models, counterfactual learning-to-rank, and online reranking close the loop so agentic search gets better from its own outcomes.

    Feedback Loops
    Learning to Rank
    Click Models
    Jun 2026Read guide
    Embeddings
    20 min read

    Matryoshka Representation Learning: Nested Embeddings for Adaptive Multimodal Retrieval

    How a single embedding model can produce vectors that stay useful when truncated to fewer dimensions, and how AI agents exploit nested embeddings to run fast coarse shortlists then precise full-dimension reranks over huge unstructured corpora.

    Matryoshka
    Nested Embeddings
    Adaptive Retrieval
    Jun 2026Read guide
    Retrieval
    21 min read

    Filtered Vector Search: How Agents Combine Similarity with Hard Constraints

    Almost every agentic query is a vector search plus a constraint -- 'clips from campaign X after May', 'images of red cars in the EU bucket'. This guide explains the three filtering strategies (pre-filter, post-filter, in-place predicate-aware traversal), why each one silently breaks recall or latency at different selectivities, and how a query planner picks between them.

    Filtered Search
    Vector Search
    HNSW
    Jun 2026Read guide
    Agent Perception
    18 min read

    How Vision-Language Models Fuse Image and Text Tokens

    A VLM is the component that lets an agent actually see: it turns pixels into tokens an LLM can reason over alongside words. This guide opens the architecture, how a vision encoder produces patch features, how a projector or resampler turns them into language tokens, and the real fusion strategies (prefix concatenation, cross-attention, Q-Former resampling) that decide whether your agent reads a frame accurately or hallucinates over it.

    Vision-Language Models
    VLM
    Multimodal Fusion
    Jun 2026Read guide
    Retrieval
    20 min read

    Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking

    An agent searching transcripts, OCR text, and captions needs both meaning (dense vectors) and exact terms (BM25), but the two return scores on incompatible scales that you cannot simply add. This guide teaches the real fusion mechanics: why score distributions make naive normalization fail, the exact math of Reciprocal Rank Fusion and how its k parameter behaves, weighted convex combination with proper normalization, and how to choose and tune a fusion method against a labeled set.

    Hybrid Search
    Reciprocal Rank Fusion
    BM25
    Jun 2026Read guide
    Agent Perception
    21 min read

    Audio Fingerprinting: How Agents Recognize a Specific Recording in Noise

    A first-principles guide to audio fingerprinting -- the algorithm behind Shazam-style recognition that identifies an exact recording even when it is noisy, pitch-shifted, or buried in other sound. Covers spectrogram peak picking, the constellation map, combinatorial landmark hashing, inverted-index voting with time-offset alignment, and how identity-level audio search differs from semantic similarity for AI agents.

    Audio
    Fingerprinting
    Agent Perception
    Jun 2026Read guide
    Embeddings
    22 min read

    Embedding Fine-Tuning and Distillation: Teaching an Agent to See and Hear Your Domain

    A generic embedding model puts your two near-identical product SKUs, your defect classes, or your domain jargon almost on top of each other -- so an agent searching your content retrieves the wrong thing. This guide teaches the algorithms that fix that: contrastive fine-tuning, hard-negative mining for adaptation, cross-encoder to bi-encoder distillation, and parameter-efficient methods, plus how to deploy a new model without silently corrupting an existing index.

    Embeddings
    Fine-Tuning
    Distillation
    Jun 2026Read guide

    From concept to production

    These guides explain how multimodal perception and retrieval actually work. Mixpeek is the platform that runs them — point it at your storage and get back relevant, timestamped results.