Holo-3.1-4B
by Hcompany
4B vision-language model for GUI agents and computer-use perception
Hcompany/Holo-3.1-4Bmixpeek://image_extractor@v1/hcompany_holo_31_4b_v1Overview
Holo-3.1-4B is a compact vision-language model tagged for action, agent, computer use, and GUI agents. It is relevant to multimodal search because many agent traces are not documents. They are screenshots, browser states, UI elements, and before-after visual states from tool calls.
On Mixpeek, Holo can turn screenshots and UI recordings into searchable agent memory. That lets an agent retrieve prior visual states, inspect similar failures, and compare what the screen looked like before deciding whether to retry, stop, or ask for help.
Architecture
Qwen-family image-text-to-text model with Hugging Face metadata for action, agent, computer use, GUI agents, and conversational visual reasoning.
Mixpeek SDK Integration
from mixpeek import Mixpeekmixpeek = Mixpeek(api_key="YOUR_API_KEY")mixpeek.ingest.images(collection="computer_use_traces",source={"type": "s3", "bucket": "agent-screens"},pipeline={"captioning": {"model": "mixpeek://image_extractor@v1/hcompany_holo_31_4b_v1"}})
Capabilities
- GUI and computer-use visual reasoning
- Screenshot state description for agent memory
- Compact 4B model size for high-volume UI traces
- Apache 2.0 licensed model card metadata on Hugging Face
Use Cases on Mixpeek
Performance
Best used with screenshot downsampling and UI event metadata filters.
Specification
Build a pipeline with Holo-3.1-4B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio