NEWManaged multimodal retrieval.Explore platform →
    Models/Captioning/OpenGVLab/InternVL3-8B
    HFScene CaptioningMIT

    InternVL3-8B

    by OpenGVLab

    Open-source multimodal model rivaling GPT-4o on vision benchmarks

    1.6Mdl/month
    8Bparams
    Identifiers
    Model ID
    OpenGVLab/InternVL3-8B
    Feature URI
    mixpeek://image_extractor@v1/opengvlab_internvl3_8b_v1

    Overview

    InternVL3-8B is an open-source vision-language model from the InternVL family that follows the ViT-MLP-LLM paradigm, combining an InternViT vision encoder with a language model backbone via an MLP projector. It achieves remarkable performance that exceeds GPT-4o on several benchmarks including MMMU (72.2 vs 70.7) while being fully open-source.

    On Mixpeek, InternVL3-8B is a top-tier open-source option for visual understanding that delivers near-proprietary-model quality for scene captioning, visual reasoning, document analysis, and scientific image understanding.

    Architecture

    ViT-MLP-LLM architecture with InternViT vision encoder connected to a Qwen2.5/InternLM3-8B language model via a randomly initialized MLP projector. Features Variable Visual Position Encoding, Native Multimodal Pre-Training, and Mixed Preference Optimization for enhanced multimodal reasoning.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "scene_description",
    version: "v1",
    params: {
    model_id: "OpenGVLab/InternVL3-8B"
    }
    }]
    });

    Capabilities

    • Outperforms GPT-4o on MMMU (72.2% vs 70.7%)
    • Strong scientific and mathematical visual reasoning
    • Tool usage, GUI agents, and industrial image analysis
    • 3D vision perception and spatial understanding
    • Multi-language visual understanding

    Use Cases on Mixpeek

    High-accuracy visual scene understanding rivaling proprietary models
    Scientific and medical image analysis for specialized content libraries
    Industrial visual inspection and quality control in manufacturing pipelines

    Benchmarks

    DatasetMetricScoreSource
    MMMUAccuracy72.2%Chen et al., 2025 — InternVL3 paper
    MathVistaAccuracy79.6%Chen et al., 2025 — InternVL3 paper
    DocVQAANLS92.7Chen et al., 2025 — InternVL3 paper

    Performance

    Input SizeText + variable resolution images
    GPU Latency~50ms / image (A100)
    GPU Throughput~20 images/sec (A100)
    GPU Memory~16 GB (bf16)

    Specification

    FrameworkHF
    OrganizationOpenGVLab
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters8B
    LicenseMIT
    Downloads/mo1.6M

    Research Paper

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    arxiv.org

    Build a pipeline with InternVL3-8B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio