SmolVLM2-2.2B-Instruct

by HuggingFaceTB

2.2B video-native VLM fitting in 5.2 GB VRAM with strong document and science understanding

238Kdl/month

2.2Bparams

HuggingFace Run on your data

Identifiers

Model ID

HuggingFaceTB/SmolVLM2-2.2B-Instruct

Feature URI

mixpeek://image_extractor@v1/hf_smolvlm2_22b_v1

Overview

SmolVLM2 is Hugging Face's lightweight multimodal model designed for efficient video, image, and text analysis at only 2.2B parameters. Built on a SigLIP vision encoder and SmolLM2 text decoder, it processes videos natively while fitting in just 5.2 GB of GPU RAM: small enough for consumer GPUs and edge devices.

On Mixpeek, SmolVLM2 enables cost-efficient visual captioning and understanding for high-volume video pipelines where larger VLMs would be prohibitively expensive. It scores 72.9% on OCRBench and 90% on ScienceQA, making it effective for document understanding and structured content analysis at a fraction of the compute cost of 7B+ models.

Architecture

SigLIP vision encoder with SmolLM2 text decoder in a Llama-style architecture. 2.2B parameters. Supports native video frame processing with temporal understanding. Only 5.2 GB GPU RAM for video inference. Apache 2.0 license.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "HuggingFaceTB/SmolVLM2-2.2B-Instruct" },
  },
});

Capabilities

Native video understanding (Video-MME: 52.1%, MLVU: 55.2%)
OCR and document understanding (OCRBench: 72.9%, DocVQA: 80.0%)
Science reasoning (ScienceQA: 90%)
Only 5.2 GB GPU RAM for video inference
Apache 2.0 open-source license

Use Cases on Mixpeek

High-volume video captioning on consumer GPUs for content libraries at minimal cost

Edge-deployed visual QA for mobile apps and embedded devices at 2.2B parameters

Document understanding and OCR-driven indexing for lightweight processing pipelines

Benchmarks

Dataset	Metric	Score	Source
Video-MME	Accuracy	52.1%	Hugging Face, 2025: Model Card
OCRBench	Accuracy	72.9%	Hugging Face, 2025: Model Card
ScienceQA	Accuracy	90.0%	Hugging Face, 2025: Model Card