Image Understanding: Vision Encoders & Multimodal Search

14:32

Multimodal University

Ethan

January 6, 2026

Summary

image-understandingclipsiglipvision-transformersmultimodal-searchyolo

About this video

Master how computers see and search images. This video covers vision encoding models like CLIP and SigLIP, how images are converted into patches and embeddings, object detection with YOLO, and building multimodal search systems that support text-to-image, image-to-text, and image-to-image queries. What you'll learn: ⚡ How vision transformers convert images into embeddings ⚡ Image patches and mean pooling explained ⚡ CLIP vs SigLIP embedding models ⚡ Object detection and classification with YOLO ⚡ Cross-modal search: text queries on image datasets ⚡ Combining text + image queries with mean pooling ⚡ Feature URIs for image extractors ⚡ Live demo: National Gallery multimodal retriever

Image Understanding: Vision Encoders & Multimodal Search

Summary

About this video

Frequently Asked Questions