A multimodal model developed by OpenAI that learns visual concepts from natural language supervision. Common in vision-language systems.
CLIP works by jointly training an image encoder and a text encoder to predict the correct pairings of images and their text descriptions. This creates a shared embedding space where similar concepts across both modalities are positioned closely together.
CLIP consists of a vision transformer (ViT) or ResNet as the image encoder and a transformer as the text encoder. It's trained on 400 million image-text pairs from the internet, using contrastive loss to align representations from both modalities.