blip2-opt-2.7b
by Salesforce
Bootstrapping Language-Image Pre-training with frozen LLMs
Salesforce/blip2-opt-2.7bmixpeek://image_extractor@v1/salesforce_blip2_v1Overview
BLIP-2 bridges the modality gap between vision and language using a lightweight Querying Transformer (Q-Former) that connects a frozen image encoder to a frozen large language model. This enables powerful visual question answering and image captioning.
On Mixpeek, BLIP-2 generates rich natural language descriptions of video frames and images, making visual content searchable with full-text queries.
Architecture
Three-stage architecture: (1) frozen ViT-G/14 image encoder, (2) Q-Former with 32 learnable query tokens that bridge vision and language, (3) frozen OPT 2.7B language model. Only the Q-Former is trained.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/video.mp4" },
feature_extractors: [{
name: "scene_description",
version: "v1",
params: {
model_id: "Salesforce/blip2-opt-2.7b"
}
}]
});Capabilities
- Natural language scene descriptions
- Visual question answering
- Image-grounded text generation
- Zero-shot visual reasoning
Use Cases on Mixpeek
Specification
Research Paper
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
arxiv.orgBuild a pipeline with blip2-opt-2.7b
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder