NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/OpenGVLab/InternVL3_5-8B
    HFScene CaptioningApache 2.0

    InternVL3_5-8B

    by OpenGVLab

    4x faster InternVL3 with cascade reinforcement learning and dynamic resolution

    N/Adl/month
    8.5Bparams
    Identifiers
    Model ID
    OpenGVLab/InternVL3_5-8B
    Feature URI
    mixpeek://image_extractor@v1/opengvlab_internvl35_8b_v1

    Overview

    InternVL 3.5 is a major upgrade over InternVL3, adding Cascade Reinforcement Learning for 16% better reasoning, a Visual Resolution Router for dynamic resolution allocation, and Decoupled Vision-Language Deployment for 4x inference speedup. It achieves SOTA among open-source VLMs on multimodal reasoning while fitting on a single A100.

    On Mixpeek, InternVL 3.5 powers high-quality scene captioning, visual QA, and document understanding at significantly lower latency than its predecessor. The dynamic resolution router automatically allocates more pixels to complex images and fewer to simple ones.

    Architecture

    InternViT-300M vision encoder + InternLM3-8B language model. 8.5B total params. Cascade RL training with progressive difficulty. Visual Resolution Router dynamically selects 224-1024px resolution per image. Decoupled deployment separates vision and language inference for 4x speedup.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "video-library",
    source: { url: "https://example.com/presentation.mp4" },
    feature_extractors: [{
    feature: "scene_caption",
    model: "OpenGVLab/InternVL3_5-8B"
    }]
    });

    Capabilities

    • 16% better reasoning than InternVL3 via Cascade RL
    • 4x faster inference via Decoupled Vision-Language Deployment
    • Dynamic resolution: allocates pixels based on image complexity
    • GUI interaction and embodied agency capabilities
    • Thinking mode with explicit chain-of-thought reasoning

    Use Cases on Mixpeek

    Scene captioning at scale: describe video frames with higher quality and lower latency
    Visual QA: answer complex questions about image and document content
    GUI understanding: extract information from application screenshots
    Chart and diagram interpretation: answer questions about visual data

    Benchmarks

    DatasetMetricScoreSource
    Overall reasoning (vs InternVL3)Improvement+16.0%OpenGVLab, 2025 — arxiv:2508.18265
    Inference speed (vs InternVL3)Speedup4.05xOpenGVLab, 2025 — arxiv:2508.18265

    Performance

    Input SizeDynamic resolution (224-1024px)
    GPU Latency~30ms / image (A100, decoupled)
    GPU Throughput~33 images/sec (A100)
    GPU Memory~16 GB

    Specification

    FrameworkHF
    OrganizationOpenGVLab
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters8.5B
    LicenseApache 2.0
    Downloads/moN/A

    Research Paper

    InternVL3.5: Advancing Multimodal Understanding

    arxiv.org

    Build a pipeline with InternVL3_5-8B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio