Kimi-VL-A3B-Thinking-2506

by moonshotai

Efficient MoE reasoning VLM with 2.8B activated parameters and SOTA video understanding

10.3Kdl/month

16B total / 2.8B activeparams

HuggingFace Run on your data

Identifiers

Model ID

moonshotai/Kimi-VL-A3B-Thinking-2506

Feature URI

mixpeek://image_extractor@v1/moonshotai_kimi_vl_a3b_v1

Overview

Kimi-VL-A3B-Thinking is Moonshot AI's efficient Mixture-of-Experts vision-language model that activates only 2.8B of its 16B total parameters per forward pass. It achieves state-of-the-art video understanding among open-source models while supporting native-resolution images up to 3.2 megapixels and 131K token context.

On Mixpeek, Kimi-VL powers high-quality scene captioning, visual reasoning, and OCR extraction at a fraction of the compute cost of dense 7B+ models. Its MoE architecture makes it especially cost-effective for batch processing large video libraries.

Architecture

Mixture-of-Experts VLM: MoonViT vision encoder (native-resolution, up to 3.2M pixels) + MLP projector + Moonlight-16B-A3B MoE language decoder. 16B total / ~2.8B activated parameters. 131K max context. Long-CoT SFT + reinforcement learning with 20% reduced thinking tokens.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "moonshotai/Kimi-VL-A3B-Thinking-2506" },
  },
});

Capabilities

SOTA video understanding for open-source (65.2 on VideoMMMU)
Only 2.8B activated parameters (MoE efficiency)
Native high-resolution image support up to 3.2 megapixels
131K token context for long documents
Strong OCR (869 on OCRBench) and GUI grounding (91.4 on ScreenSpot-V2)

Use Cases on Mixpeek

Video scene captioning at scale: describe every scene in large video archives

Document understanding: extract structured data from scanned documents and forms

Visual reasoning: answer complex questions about image and video content

GUI and screenshot analysis: extract information from application interfaces

Benchmarks

Dataset	Metric	Score	Source
VideoMMMU	Accuracy	65.2	Moonshot AI, 2025: arxiv,2504.07491
MMMU	Pass@1	64.0	Moonshot AI, 2025: arxiv,2504.07491
MathVision	Pass@1	56.9	Moonshot AI, 2025: arxiv,2504.07491