Cross-modal tasks involving the generation or retrieval of one modality based on another (e.g., image captioning or text-guided image retrieval).
Text-to-Image and Image-to-Text tasks involve generating or retrieving content in one modality based on input from another. These cross-modal tasks enable applications like image captioning, text-guided image retrieval, and more.
These tasks use models that integrate text and image data, often employing attention mechanisms and multimodal embeddings. Techniques include transformer-based models and generative adversarial networks (GANs) for high-quality outputs.