InternVL3-8B

by OpenGVLab

Open-source multimodal model rivaling GPT-4o on vision benchmarks

1.6Mdl/month

8Bparams

HuggingFace Run on your data

Identifiers

Model ID

OpenGVLab/InternVL3-8B

Feature URI

mixpeek://image_extractor@v1/opengvlab_internvl3_8b_v1

Overview

InternVL3-8B is an open-source vision-language model from the InternVL family that follows the ViT-MLP-LLM paradigm, combining an InternViT vision encoder with a language model backbone via an MLP projector. It achieves remarkable performance that exceeds GPT-4o on several benchmarks including MMMU (72.2 vs 70.7) while being fully open-source.

On Mixpeek, InternVL3-8B is a top-tier open-source option for visual understanding that delivers near-proprietary-model quality for scene captioning, visual reasoning, document analysis, and scientific image understanding.

Architecture

ViT-MLP-LLM architecture with InternViT vision encoder connected to a Qwen2.5/InternLM3-8B language model via a randomly initialized MLP projector. Features Variable Visual Position Encoding, Native Multimodal Pre-Training, and Mixed Preference Optimization for enhanced multimodal reasoning.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_description",
    version: "v1",
    parameters: { model_id: "OpenGVLab/InternVL3-8B" },
  },
});

Capabilities

Outperforms GPT-4o on MMMU (72.2% vs 70.7%)
Strong scientific and mathematical visual reasoning
Tool usage, GUI agents, and industrial image analysis
3D vision perception and spatial understanding
Multi-language visual understanding

Use Cases on Mixpeek

High-accuracy visual scene understanding rivaling proprietary models

Scientific and medical image analysis for specialized content libraries

Industrial visual inspection and quality control in manufacturing pipelines

Benchmarks

Dataset	Metric	Score	Source
MMMU	Accuracy	72.2%	Chen et al., 2025: InternVL3 paper
MathVista	Accuracy	79.6%	Chen et al., 2025: InternVL3 paper
DocVQA	ANLS	92.7	Chen et al., 2025: InternVL3 paper