blip2-opt-2.7b
by Salesforce
Bootstrapping Language-Image Pre-training with frozen LLMs
Salesforce/blip2-opt-2.7bmixpeek://image_extractor@v1/salesforce_blip2_v1Overview
BLIP-2 bridges the modality gap between vision and language using a lightweight Querying Transformer (Q-Former) that connects a frozen image encoder to a frozen large language model. This enables powerful visual question answering and image captioning.
On Mixpeek, BLIP-2 generates rich natural language descriptions of video frames and images, making visual content searchable with full-text queries.
Architecture
Three-stage architecture: (1) frozen ViT-G/14 image encoder, (2) Q-Former with 32 learnable query tokens that bridge vision and language, (3) frozen OPT 2.7B language model. Only the Q-Former is trained.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{name: "scene_description",version: "v1",params: {model_id: "Salesforce/blip2-opt-2.7b"}}]});
Capabilities
- Natural language scene descriptions
- Visual question answering
- Image-grounded text generation
- Zero-shot visual reasoning
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| COCO Captioning | CIDEr | 145.8 | Li et al., 2023 — Table 3 |
| VQAv2 (test-dev) | Accuracy | 65.0% | Li et al., 2023 — Table 4 |
| NoCaps (val) | CIDEr | 121.6 | Li et al., 2023 — Table 3 |
Performance
Includes OPT-2.7B LLM decoder for caption generation
Specification
Research Paper
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
arxiv.orgBuild a pipeline with blip2-opt-2.7b
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder