LocateAnything-3B
by nvidia
Open-vocabulary visual grounding for locating arbitrary objects in images
nvidia/LocateAnything-3Bmixpeek://image_extractor@v1/nvidia_locateanything_3b_v1Overview
LocateAnything 3B is an NVIDIA vision-language model for open-vocabulary localization. Instead of predicting only a fixed detector label set, it uses a text prompt to identify and localize the requested visual target.
On Mixpeek, LocateAnything is useful when an agent needs structured evidence from images or frames but the target classes are not known when the pipeline is built. The agent can ask for objects, UI components, safety conditions, or domain-specific items and store the resulting boxes as searchable metadata.
Architecture
3B-class vision-language model exposed as an image-text-to-text Transformers checkpoint. It accepts visual input plus a grounding prompt and returns localization-oriented outputs.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "inspection-images",source: { url: "s3://field-inspections/" },feature_extractors: [{feature: "object_detection",model: "nvidia/LocateAnything-3B"}]});
Capabilities
- Open-vocabulary object localization
- Promptable image grounding
- Useful for long-tail object classes
- Transforms visual observations into structured metadata
Use Cases on Mixpeek
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
LocateAnything 3B
arxiv.orgBuild a pipeline with LocateAnything-3B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio