A multimodal task that automatically produces natural language descriptions of image content. Image captioning creates searchable text representations of visual data, enabling keyword search over image collections in multimodal systems.
Image captioning models encode an image into a feature representation using a vision encoder, then decode that representation into a sequence of words using a language model. Modern approaches use vision-language models that combine pretrained visual encoders with large language models. The decoder generates words autoregressively, conditioned on the image features and previously generated tokens.
State-of-the-art approaches include BLIP-2, LLaVA, and CogVLM that connect frozen vision encoders (ViT, EVA-CLIP) with large language models via lightweight adapters. These models can generate both short captions and detailed descriptions. Evaluation metrics include CIDEr, BLEU, METEOR, and human preference ratings. Models can be prompted for different levels of detail and focus areas.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. From $25/mo.
Start with MVS