Image embeddings are dense numerical vectors that encode the visual content, objects, scenes, and semantic meaning of an image into a fixed-dimensional representation. These vectors enable similarity search, clustering, classification, and cross-modal retrieval by converting visual information into a format compatible with standard machine learning operations.
Image embedding models take a raw image as input, pass it through a deep neural network (CNN or Vision Transformer), and output a fixed-length vector. During training, the model learns to map visually similar images to nearby points in the embedding space and dissimilar images to distant points. At inference time, images are preprocessed (resized, normalized) and fed through the network in a single forward pass to extract the embedding vector from an intermediate layer.
Popular image embedding architectures include ResNet, EfficientNet, Vision Transformers (ViT), and DINOv2 for visual-only embeddings, and CLIP, SigLIP, and Vertex AI for multimodal embeddings that align images with text. Embedding dimensions typically range from 256 to 2048. Models are pretrained on large-scale image datasets (ImageNet, LAION) using supervised classification, contrastive learning, or self-supervised objectives. The choice of pooling strategy (global average, CLS token, attention pooling) affects what visual information the embedding captures.