Lance
by bytedance-research
Unified 3B model for image and video understanding, generation, and editing
bytedance-research/Lancemixpeek://video_extractor@v1/bytedance_lance_3b_v1Overview
Lance is ByteDance's 3B-parameter unified vision model that handles image understanding, video understanding, image generation, video generation, and image/video editing in a single architecture. It uses a vision tokenizer to convert between continuous pixel space and discrete token space, enabling a shared transformer to reason across both modalities.
On Mixpeek, Lance is relevant as a compact video understanding model that can caption, describe, and answer questions about both images and video content. Its unified architecture means a single model can power scene description, visual Q&A, and content analysis pipelines.
Architecture
Unified autoregressive transformer with a learned vision tokenizer. 3B parameters. Supports text-to-image, text-to-video, image/video understanding, and editing through a shared token space.
Mixpeek SDK Integration
from mixpeek import Mixpeekmixpeek = Mixpeek(api_key="YOUR_API_KEY")mixpeek.ingest.videos(collection="media_library",source={"type": "s3", "bucket": "video-assets"},pipeline={"captioning": {"model": "mixpeek://video_extractor@v1/bytedance_lance_3b_v1"}})
Capabilities
- Unified image and video understanding in one model
- Scene description and visual Q&A for both images and video
- Compact 3B parameter count suitable for GPU-constrained deployments
- Multi-task capability reduces pipeline complexity
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| Video-MME | Accuracy | 62.1 | Model card |
| MMMU-Pro (vision) | Score | 38.4 | Model card |
Performance
Common Pipeline Companions
Specification
Research Paper
Model paper or technical report
arxiv.orgBuild a pipeline with Lance
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio