GME-VARCO-VISION-Embedding
by NCSOFT
Multimodal image and video embedding model for vision-heavy retrieval
NCSOFT/GME-VARCO-VISION-Embeddingmixpeek://image_extractor@v1/ncsoft_gme_varco_vision_embedding_v1Overview
GME VARCO Vision Embedding is NCSOFT's multimodal embedding model for image, text, and video retrieval. It is based on Qwen2-VL-7B-Instruct and is tagged for video embedding and feature extraction.
On Mixpeek, it fits archives where agents need to search visual scenes and short clips by text, then pass the retrieved moments into a captioner, VLM, or workflow tool.
Architecture
Fine-tuned Qwen2-VL-7B-Instruct model for multimodal embedding. The model card lists image-text-to-text and feature-extraction tags, plus video embedding support.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "video-archive",source: { url: "https://example.com/field-video.mp4" },feature_extractors: [{feature: "video_embedding",model: "NCSOFT/GME-VARCO-VISION-Embedding"}]});
Capabilities
- Image, text, and video embedding
- Vision-language feature extraction
- Video retrieval support
- Useful as a first-stage retrieval model before detailed VLM analysis
Use Cases on Mixpeek
Specification
Research Paper
GME VARCO Vision Embedding
arxiv.orgBuild a pipeline with GME-VARCO-VISION-Embedding
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio