NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/HuggingFaceTB/SmolVLM2-2.2B-Instruct
    HFScene CaptioningApache 2.0

    SmolVLM2-2.2B-Instruct

    by HuggingFaceTB

    2.2B video-native VLM fitting in 5.2 GB VRAM with strong document and science understanding

    238Kdl/month
    2.2Bparams
    Identifiers
    Model ID
    HuggingFaceTB/SmolVLM2-2.2B-Instruct
    Feature URI
    mixpeek://image_extractor@v1/hf_smolvlm2_22b_v1

    Overview

    SmolVLM2 is Hugging Face's lightweight multimodal model designed for efficient video, image, and text analysis at only 2.2B parameters. Built on a SigLIP vision encoder and SmolLM2 text decoder, it processes videos natively while fitting in just 5.2 GB of GPU RAM — small enough for consumer GPUs and edge devices.

    On Mixpeek, SmolVLM2 enables cost-efficient visual captioning and understanding for high-volume video pipelines where larger VLMs would be prohibitively expensive. It scores 72.9% on OCRBench and 90% on ScienceQA, making it effective for document understanding and structured content analysis at a fraction of the compute cost of 7B+ models.

    Architecture

    SigLIP vision encoder with SmolLM2 text decoder in a Llama-style architecture. 2.2B parameters. Supports native video frame processing with temporal understanding. Only 5.2 GB GPU RAM for video inference. Apache 2.0 license.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/product-demo.mp4" },
    feature_extractors: [{
    name: "scene_caption",
    version: "v1",
    params: {
    model_id: "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
    }
    }]
    });

    Capabilities

    • Native video understanding (Video-MME: 52.1%, MLVU: 55.2%)
    • OCR and document understanding (OCRBench: 72.9%, DocVQA: 80.0%)
    • Science reasoning (ScienceQA: 90%)
    • Only 5.2 GB GPU RAM for video inference
    • Apache 2.0 open-source license

    Use Cases on Mixpeek

    High-volume video captioning on consumer GPUs for content libraries at minimal cost
    Edge-deployed visual QA for mobile apps and embedded devices at 2.2B parameters
    Document understanding and OCR-driven indexing for lightweight processing pipelines

    Benchmarks

    DatasetMetricScoreSource
    Video-MMEAccuracy52.1%Hugging Face, 2025 — Model Card
    OCRBenchAccuracy72.9%Hugging Face, 2025 — Model Card
    ScienceQAAccuracy90.0%Hugging Face, 2025 — Model Card

    Performance

    Input SizeImages: variable; Video: native frame processing
    GPU Latency~18ms / frame (A100)
    GPU Throughput~55 frames/sec (A100)
    GPU Memory~5.2 GB

    Specification

    FrameworkHF
    OrganizationHuggingFaceTB
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters2.2B
    LicenseApache 2.0
    Downloads/mo238K

    Research Paper

    SmolVLM2 Model Card

    arxiv.org

    Build a pipeline with SmolVLM2-2.2B-Instruct

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio