Cross-modal tasks involving the generation or retrieval of one modality based on another (e.g., image captioning or text-guided image retrieval).
Text-to-Image and Image-to-Text tasks involve generating or retrieving content in one modality based on input from another. These cross-modal tasks enable applications like image captioning, text-guided image retrieval, and more.
These tasks use models that integrate text and image data, often employing attention mechanisms and multimodal embeddings. Techniques include transformer-based models and generative adversarial networks (GANs) for high-quality outputs.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS