NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/moonshotai/Kimi-VL-A3B-Thinking-2506
    HFScene CaptioningMIT

    Kimi-VL-A3B-Thinking-2506

    by moonshotai

    Efficient MoE reasoning VLM with 2.8B activated parameters and SOTA video understanding

    10.3Kdl/month
    16B total / 2.8B activeparams
    Identifiers
    Model ID
    moonshotai/Kimi-VL-A3B-Thinking-2506
    Feature URI
    mixpeek://image_extractor@v1/moonshotai_kimi_vl_a3b_v1

    Overview

    Kimi-VL-A3B-Thinking is Moonshot AI's efficient Mixture-of-Experts vision-language model that activates only 2.8B of its 16B total parameters per forward pass. It achieves state-of-the-art video understanding among open-source models while supporting native-resolution images up to 3.2 megapixels and 131K token context.

    On Mixpeek, Kimi-VL powers high-quality scene captioning, visual reasoning, and OCR extraction at a fraction of the compute cost of dense 7B+ models. Its MoE architecture makes it especially cost-effective for batch processing large video libraries.

    Architecture

    Mixture-of-Experts VLM: MoonViT vision encoder (native-resolution, up to 3.2M pixels) + MLP projector + Moonlight-16B-A3B MoE language decoder. 16B total / ~2.8B activated parameters. 131K max context. Long-CoT SFT + reinforcement learning with 20% reduced thinking tokens.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "video-library",
    source: { url: "https://example.com/training-session.mp4" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "moonshotai/Kimi-VL-A3B-Thinking-2506"
    }]
    });

    Capabilities

    • SOTA video understanding for open-source (65.2 on VideoMMMU)
    • Only 2.8B activated parameters (MoE efficiency)
    • Native high-resolution image support up to 3.2 megapixels
    • 131K token context for long documents
    • Strong OCR (869 on OCRBench) and GUI grounding (91.4 on ScreenSpot-V2)

    Use Cases on Mixpeek

    Video scene captioning at scale: describe every scene in large video archives
    Document understanding: extract structured data from scanned documents and forms
    Visual reasoning: answer complex questions about image and video content
    GUI and screenshot analysis: extract information from application interfaces

    Benchmarks

    DatasetMetricScoreSource
    VideoMMMUAccuracy65.2Moonshot AI, 2025 — arxiv:2504.07491
    MMMUPass@164.0Moonshot AI, 2025 — arxiv:2504.07491
    MathVisionPass@156.9Moonshot AI, 2025 — arxiv:2504.07491

    Performance

    Input SizeUp to 3.2M pixels (native resolution)
    GPU Latency~45ms / image (A100)
    GPU Throughput~22 images/sec (A100)
    GPU Memory~8 GB (MoE sparse activation)

    Specification

    FrameworkHF
    Organizationmoonshotai
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters16B total / 2.8B active
    LicenseMIT
    Downloads/mo10.3K

    Research Paper

    Kimi-VL Technical Report

    arxiv.org

    Build a pipeline with Kimi-VL-A3B-Thinking-2506

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio