SmolVLM2-2.2B-Instruct
by HuggingFaceTB
2.2B video-native VLM fitting in 5.2 GB VRAM with strong document and science understanding
HuggingFaceTB/SmolVLM2-2.2B-Instructmixpeek://image_extractor@v1/hf_smolvlm2_22b_v1Overview
SmolVLM2 is Hugging Face's lightweight multimodal model designed for efficient video, image, and text analysis at only 2.2B parameters. Built on a SigLIP vision encoder and SmolLM2 text decoder, it processes videos natively while fitting in just 5.2 GB of GPU RAM — small enough for consumer GPUs and edge devices.
On Mixpeek, SmolVLM2 enables cost-efficient visual captioning and understanding for high-volume video pipelines where larger VLMs would be prohibitively expensive. It scores 72.9% on OCRBench and 90% on ScienceQA, making it effective for document understanding and structured content analysis at a fraction of the compute cost of 7B+ models.
Architecture
SigLIP vision encoder with SmolLM2 text decoder in a Llama-style architecture. 2.2B parameters. Supports native video frame processing with temporal understanding. Only 5.2 GB GPU RAM for video inference. Apache 2.0 license.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/product-demo.mp4" },feature_extractors: [{name: "scene_caption",version: "v1",params: {model_id: "HuggingFaceTB/SmolVLM2-2.2B-Instruct"}}]});
Capabilities
- Native video understanding (Video-MME: 52.1%, MLVU: 55.2%)
- OCR and document understanding (OCRBench: 72.9%, DocVQA: 80.0%)
- Science reasoning (ScienceQA: 90%)
- Only 5.2 GB GPU RAM for video inference
- Apache 2.0 open-source license
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Video-MME | Accuracy | 52.1% | Hugging Face, 2025 — Model Card |
| OCRBench | Accuracy | 72.9% | Hugging Face, 2025 — Model Card |
| ScienceQA | Accuracy | 90.0% | Hugging Face, 2025 — Model Card |
Performance
Specification
Research Paper
SmolVLM2 Model Card
arxiv.orgBuild a pipeline with SmolVLM2-2.2B-Instruct
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio