Computer Vision API for Object Detection, Recognition, and Analysis
Computer vision features are extracted in the warehouse's Decompose layer and queried through multi-stage retrieval pipelines. Detect objects, classify images, extract text, and build visual similarity search across millions of images with managed GPU infrastructure.
What is Computer Vision?
Computer vision enables machines to interpret and analyze visual data -- images, video frames, and documents. Mixpeek transforms computer vision from a per-image API call into a searchable infrastructure layer where every visual feature becomes queryable.
From Analysis to Search
Traditional computer vision APIs analyze one image at a time and return labels. Mixpeek processes images through feature extraction pipelines, generates embeddings, and indexes everything into Qdrant namespaces. The result is not just labels -- it is a searchable visual knowledge base you can query with text, images, or metadata filters.
Multi-Model Pipelines
Run multiple vision models on every image in a single collection pipeline. Combine object detection with classification, OCR, and embedding generation. Each extractor adds a layer of understanding to your visual data, all indexed together in the same namespace for multi-faceted search.
Managed GPU Infrastructure
Computer vision models require GPU compute. Mixpeek runs all feature extractors on auto-scaling Ray GPU clusters, so you never manage GPU instances, model serving, or batch job infrastructure. Process millions of images with the same API you use for one.
Computer Vision Capabilities
Every visual analysis capability you need, from object detection to visual similarity search, running on managed GPU infrastructure.
Object Detection
Detect and localize objects within images and video frames with bounding boxes and confidence scores. Index detected objects as searchable metadata across your entire visual dataset.
- Multi-class object detection with bounding boxes
- Confidence scoring and threshold filtering
- Searchable object metadata in Qdrant namespaces
Image Classification
Classify images into categories using pre-trained or custom models. Automatically tag images with labels, categories, and hierarchical taxonomies during ingestion.
- Multi-label classification with confidence scores
- Custom taxonomy support via Mixpeek taxonomies
- Hierarchical category tagging
Scene Understanding
Analyze the overall context of images and video frames. Extract scene descriptions, spatial relationships between objects, and environmental attributes for semantic search.
- Natural language scene descriptions
- Spatial relationship extraction
- Environmental and contextual attribute detection
Face Detection and Analysis
Detect faces in images and video with landmark localization. Extract face embeddings for similarity search across your visual data while respecting privacy configurations.
- Face detection with landmark localization
- Face embedding extraction for similarity search
- Configurable privacy controls and redaction
Optical Character Recognition
Extract text from images, screenshots, and scanned documents. OCR output is indexed alongside visual features, enabling search across both visual and textual content in a single query.
- Printed and handwritten text extraction
- Layout-aware document parsing
- Multi-language text recognition
Visual Similarity Search
Generate dense visual embeddings from images and search by visual similarity. Find visually similar products, scenes, or objects across millions of images in milliseconds.
- Dense visual embedding generation
- Cross-modal search (text query to image results)
- Sub-100ms retrieval at scale with Qdrant
Supported Models
Run state-of-the-art computer vision models as feature extractors, or bring your own via the Docker plugin system.
Object Detection
Detect and localize objects with bounding boxes
Image Embeddings
Generate dense visual embeddings for similarity search
OCR
Extract text from images and documents
Scene Understanding
Generate natural language descriptions of visual content
Face Analysis
Detect faces and generate face embeddings
Classification
Classify images into categories and labels
Industry Applications
Computer vision search infrastructure powers critical workflows across industries where visual data drives decisions.
E-Commerce and Retail
Power visual product search, automated catalog tagging, and similar product recommendations. Detect products in user-uploaded images and match them to your catalog using visual similarity. Automate quality control with defect detection on product images.
Manufacturing and Quality Control
Inspect products on assembly lines using object detection and defect classification. Build searchable archives of inspection images. Flag anomalies automatically and enable engineers to search historical defect patterns across production runs.
Security and Surveillance
Analyze surveillance footage at scale. Detect objects, track movement patterns, and search across camera feeds by visual attributes. Build retrievable indexes of security footage with object and event metadata.
Healthcare and Life Sciences
Index medical images for research and clinical workflows. Enable similarity search across radiology scans, pathology slides, and clinical photographs. Extract measurements and annotations from medical imaging data.
Mixpeek vs. Computer Vision Alternatives
See how Mixpeek compares to dedicated computer vision APIs for visual search and analysis use cases.
| Feature | Mixpeek | Clarifai | Google Vision | AWS Rekognition |
|---|---|---|---|---|
| Modality Support | Vision + video + audio + text + PDF (unified) | Vision + video + text (separate APIs) | Vision only (separate video API) | Vision + video (separate APIs) |
| Search Infrastructure | Built-in hybrid search (vector + keyword) | Visual search (limited text search) | Label-based only (BYO search) | Face search only (BYO general search) |
| Custom Models | Docker-based plugin system on Ray GPU clusters | Platform-hosted training | AutoML Vision training | Custom Labels (limited) |
| Retriever Pipelines | Composable multi-stage (filter, search, rerank) | Workflows (sequential processing) | Not available | Not available |
| Embedding Generation | 50+ extractors (CLIP, SigLIP, DINOv2, custom) | Platform-specific embeddings | Not exposed | Face vectors only |
| Deployment Options | Managed, Dedicated, BYO Cloud | Managed SaaS or on-prem | Google Cloud only | AWS only |
Build Visual Search in Minutes
A simple Python API to detect objects, classify images, and search visual content with composable retriever pipelines.
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_API_KEY")
# Create a collection with computer vision extractors
collection = client.collections.create(
name="product-catalog",
namespace="products",
extractors=[
{
"type": "image_embedding",
"model": "clip-vit-large",
"config": {
"resolution": 336
}
},
{
"type": "object_detection",
"model": "yolov8-x",
"config": {
"confidence_threshold": 0.5,
"classes": ["product", "label", "barcode"]
}
},
{
"type": "image_classification",
"model": "vit-large",
"config": {
"taxonomy_id": "product-categories"
}
}
]
)
# Upload images to trigger feature extraction
client.buckets.upload(
bucket="my-bucket",
files=["product_001.jpg", "product_002.jpg"],
collection=collection.id
)
# Search by text description (cross-modal)
results = client.retrievers.execute(
namespace="products",
stages=[
{
"type": "feature_search",
"method": "hybrid",
"query": {
"text": "red running shoes with white sole",
"modalities": ["image"]
},
"limit": 20
},
{
"type": "filter",
"conditions": {
"metadata.objects": {"$contains": "product"},
"metadata.category": "footwear"
}
},
{
"type": "rerank",
"model": "cross-encoder",
"limit": 10
}
]
)
for result in results:
print(f"Product: {result.metadata['filename']}")
print(f"Category: {result.metadata['category']}")
print(f"Objects: {result.metadata['objects']}")
print(f"Score: {result.score}")Frequently Asked Questions
What is a computer vision API?
A computer vision API provides programmatic access to visual analysis capabilities -- object detection, image classification, face detection, OCR, and visual similarity search. Mixpeek goes beyond single-image analysis by providing a complete computer vision infrastructure that extracts features, indexes them into searchable namespaces, and enables composable retriever pipelines for querying visual content at scale.
How does object detection work in Mixpeek?
Mixpeek runs object detection models (YOLO, DETR, Faster R-CNN) as feature extractors on Ray GPU clusters. When images or video frames are ingested through a collection, the object detection extractor identifies objects with bounding boxes and confidence scores. These detections are stored as structured metadata on documents in your Qdrant namespace, making them filterable and searchable through retriever pipelines.
Can I use my own computer vision models with Mixpeek?
Yes. Mixpeek supports custom models via its Docker-based plugin system. Package your model in a container, register it as a custom feature extractor, and deploy it on Mixpeek's Ray GPU clusters. Your custom model runs alongside built-in extractors in the same collection pipeline, with the same auto-scaling and monitoring infrastructure.
How does visual similarity search work?
Mixpeek generates dense visual embeddings from images using models like CLIP, SigLIP, or DINOv2. These embeddings are indexed in Qdrant namespaces. When you search, you can query with text (cross-modal search) or with another image (image-to-image similarity). The retriever pipeline calculates vector similarity and returns the most visually similar results, optionally combined with metadata filtering and reranking.
What image formats does Mixpeek support?
Mixpeek supports JPEG, PNG, TIFF, WebP, BMP, GIF, and SVG image formats. For video, it supports MP4, MOV, AVI, MKV, and WebM, extracting frames for visual analysis. Images are uploaded to Mixpeek buckets and processed by feature extractors automatically. The platform handles format conversion, resizing, and normalization during ingestion.
How is Mixpeek different from Google Vision API or AWS Rekognition?
Google Vision and AWS Rekognition are single-image analysis APIs -- you send an image and get labels or detections back. Mixpeek is a search infrastructure platform: it processes images through feature extraction pipelines, indexes results into vector namespaces, and provides composable retriever pipelines for querying. You also get multimodal support, so visual features can be searched alongside text, audio, and video content.
Does Mixpeek support real-time computer vision?
Mixpeek supports both real-time and batch processing. Collection triggers automatically process new images as they arrive in your bucket, enabling near-real-time feature extraction. For batch use cases, Mixpeek's Ray clusters scale horizontally to process millions of images in parallel. Retriever queries execute in sub-100ms for production search applications.
Can I combine computer vision with text and audio search?
Yes. This is a core strength of Mixpeek. All modalities -- images, video, audio, text, and documents -- are indexed into the same namespace. A single retriever pipeline can search across visual features, text content, and audio simultaneously. For example, you can search for video content matching a text description and filter by detected objects and transcribed speech in one query.
