InternVL3-78B
by OpenGVLab
78B flagship multimodal LLM for image, video, and document understanding
OpenGVLab/InternVL3-78Bmixpeek://image_extractor@v1/opengvlab_internvl3_78b_v1Overview
InternVL3-78B is OpenGVLab's flagship open-source multimodal LLM, scaling the InternVL3 architecture to 78B parameters for state-of-the-art performance across image understanding, video comprehension, document analysis, and chart interpretation.
InternVL3-78B achieves top results among open-source MLLMs on general multimodal benchmarks, reasoning tasks, and agentic evaluations. On Mixpeek, it serves as the highest-quality option for scene description, visual Q&A, and structured extraction from complex visual content where accuracy matters more than latency.
Architecture
InternViT-6B vision encoder + InternLM3-78B language model with dynamic resolution support. 78B total parameters. Processes images at up to 4K resolution with tile-based encoding. Supports interleaved image-text and multi-frame video input.
Mixpeek SDK Integration
from mixpeek import Mixpeekmixpeek = Mixpeek(api_key="YOUR_API_KEY")mixpeek.ingest.videos(collection="documents",source={"type": "s3", "bucket": "visual-docs"},pipeline={"captioning": {"model": "mixpeek://image_extractor@v1/opengvlab_internvl3_78b_v1"}})
Capabilities
- State-of-the-art open-source multimodal understanding
- High-resolution image analysis with dynamic tiling
- Complex document and chart comprehension
- Multi-frame video understanding
- Structured data extraction from visual content
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMMU | Accuracy | 72.2 | Model card |
| MathVista | Score | 74.5 | Model card |
| DocVQA | Accuracy | 94.8 | Model card |
Performance
Common Pipeline Companions
Specification
Research Paper
Model paper or technical report
arxiv.orgBuild a pipeline with InternVL3-78B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio