InternVL3-8B
by OpenGVLab
Open-source multimodal model rivaling GPT-4o on vision benchmarks
OpenGVLab/InternVL3-8Bmixpeek://image_extractor@v1/opengvlab_internvl3_8b_v1Overview
InternVL3-8B is an open-source vision-language model from the InternVL family that follows the ViT-MLP-LLM paradigm, combining an InternViT vision encoder with a language model backbone via an MLP projector. It achieves remarkable performance that exceeds GPT-4o on several benchmarks including MMMU (72.2 vs 70.7) while being fully open-source.
On Mixpeek, InternVL3-8B is a top-tier open-source option for visual understanding that delivers near-proprietary-model quality for scene captioning, visual reasoning, document analysis, and scientific image understanding.
Architecture
ViT-MLP-LLM architecture with InternViT vision encoder connected to a Qwen2.5/InternLM3-8B language model via a randomly initialized MLP projector. Features Variable Visual Position Encoding, Native Multimodal Pre-Training, and Mixed Preference Optimization for enhanced multimodal reasoning.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{name: "scene_description",version: "v1",params: {model_id: "OpenGVLab/InternVL3-8B"}}]});
Capabilities
- Outperforms GPT-4o on MMMU (72.2% vs 70.7%)
- Strong scientific and mathematical visual reasoning
- Tool usage, GUI agents, and industrial image analysis
- 3D vision perception and spatial understanding
- Multi-language visual understanding
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMMU | Accuracy | 72.2% | Chen et al., 2025 — InternVL3 paper |
| MathVista | Accuracy | 79.6% | Chen et al., 2025 — InternVL3 paper |
| DocVQA | ANLS | 92.7 | Chen et al., 2025 — InternVL3 paper |
Performance
Specification
Research Paper
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arxiv.orgBuild a pipeline with InternVL3-8B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio