A multimodal task that automatically produces natural language descriptions of image content. Image captioning creates searchable text representations of visual data, enabling keyword search over image collections in multimodal systems.
Image captioning models encode an image into a feature representation using a vision encoder, then decode that representation into a sequence of words using a language model. Modern approaches use vision-language models that combine pretrained visual encoders with large language models. The decoder generates words autoregressively, conditioned on the image features and previously generated tokens.
State-of-the-art approaches include BLIP-2, LLaVA, and CogVLM that connect frozen vision encoders (ViT, EVA-CLIP) with large language models via lightweight adapters. These models can generate both short captions and detailed descriptions. Evaluation metrics include CIDEr, BLEU, METEOR, and human preference ratings. Models can be prompted for different levels of detail and focus areas.