InternVL3_5-8B

by OpenGVLab

4x faster InternVL3 with cascade reinforcement learning and dynamic resolution

N/Adl/month

8.5Bparams

HuggingFace Run on your data

Identifiers

Model ID

OpenGVLab/InternVL3_5-8B

Feature URI

mixpeek://image_extractor@v1/opengvlab_internvl35_8b_v1

Overview

InternVL 3.5 is a major upgrade over InternVL3, adding Cascade Reinforcement Learning for 16% better reasoning, a Visual Resolution Router for dynamic resolution allocation, and Decoupled Vision-Language Deployment for 4x inference speedup. It achieves SOTA among open-source VLMs on multimodal reasoning while fitting on a single A100.

On Mixpeek, InternVL 3.5 powers high-quality scene captioning, visual QA, and document understanding at significantly lower latency than its predecessor. The dynamic resolution router automatically allocates more pixels to complex images and fewer to simple ones.

Architecture

InternViT-300M vision encoder + InternLM3-8B language model. 8.5B total params. Cascade RL training with progressive difficulty. Visual Resolution Router dynamically selects 224-1024px resolution per image. Decoupled deployment separates vision and language inference for 4x speedup.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "OpenGVLab/InternVL3_5-8B" },
  },
});

Capabilities

16% better reasoning than InternVL3 via Cascade RL
4x faster inference via Decoupled Vision-Language Deployment
Dynamic resolution: allocates pixels based on image complexity
GUI interaction and embodied agency capabilities
Thinking mode with explicit chain-of-thought reasoning

Use Cases on Mixpeek

Scene captioning at scale: describe video frames with higher quality and lower latency

Visual QA: answer complex questions about image and document content

GUI understanding: extract information from application screenshots

Chart and diagram interpretation: answer questions about visual data

Benchmarks

Dataset	Metric	Score	Source
Overall reasoning (vs InternVL3)	Improvement	+16.0%	OpenGVLab, 2025: arxiv,2508.18265
Inference speed (vs InternVL3)	Speedup	4.05x	OpenGVLab, 2025: arxiv,2508.18265