Mixpeek Logo
    Back to Videos

    Image Understanding: Vision Encoders & Multimodal Search

    14:32
    Multimodal University
    Ethan

    About this video

    Master how computers see and search images. This video covers vision encoding models like CLIP and SigLIP, how images are converted into patches and embeddings, object detection with YOLO, and building multimodal search systems that support text-to-image, image-to-text, and image-to-image queries. What you'll learn: ⚡ How vision transformers convert images into embeddings ⚡ Image patches and mean pooling explained ⚡ CLIP vs SigLIP embedding models ⚡ Object detection and classification with YOLO ⚡ Cross-modal search: text queries on image datasets ⚡ Combining text + image queries with mean pooling ⚡ Feature URIs for image extractors ⚡ Live demo: National Gallery multimodal retriever