A multimodal model developed by OpenAI that learns visual concepts from natural language supervision. Common in vision-language systems.
CLIP works by jointly training an image encoder and a text encoder to predict the correct pairings of images and their text descriptions. This creates a shared embedding space where similar concepts across both modalities are positioned closely together.
CLIP consists of a vision transformer (ViT) or ResNet as the image encoder and a transformer as the text encoder. It's trained on 400 million image-text pairs from the internet, using contrastive loss to align representations from both modalities.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS