Visual Intelligence

Computer Vision API for Object Detection, Recognition, and Analysis

Computer vision features are extracted in the warehouse's Decompose layer and queried through multi-stage retrieval pipelines. Detect objects, classify images, extract text, and build visual similarity search across millions of images with managed GPU infrastructure.

What is Computer Vision?

Computer vision enables machines to interpret and analyze visual data -- images, video frames, and documents. Mixpeek transforms computer vision from a per-image API call into a searchable infrastructure layer where every visual feature becomes queryable.

From Analysis to Search

Traditional computer vision APIs analyze one image at a time and return labels. Mixpeek processes images through feature extraction pipelines, generates embeddings, and indexes everything into Qdrant namespaces. The result is not just labels -- it is a searchable visual knowledge base you can query with text, images, or metadata filters.

Multi-Model Pipelines

Run multiple vision models on every image in a single collection pipeline. Combine object detection with classification, OCR, and embedding generation. Each extractor adds a layer of understanding to your visual data, all indexed together in the same namespace for multi-faceted search.

Managed GPU Infrastructure

Computer vision models require GPU compute. Mixpeek runs all feature extractors on auto-scaling Ray GPU clusters, so you never manage GPU instances, model serving, or batch job infrastructure. Process millions of images with the same API you use for one.

Computer Vision Capabilities

Every visual analysis capability you need, from object detection to visual similarity search, running on managed GPU infrastructure.

Object Detection

Detect and localize objects within images and video frames with bounding boxes and confidence scores. Index detected objects as searchable metadata across your entire visual dataset.

Multi-class object detection with bounding boxes
Confidence scoring and threshold filtering
Searchable object metadata in Qdrant namespaces

Image Classification

Classify images into categories using pre-trained or custom models. Automatically tag images with labels, categories, and hierarchical taxonomies during ingestion.

Multi-label classification with confidence scores
Custom taxonomy support via Mixpeek taxonomies
Hierarchical category tagging

Scene Understanding

Analyze the overall context of images and video frames. Extract scene descriptions, spatial relationships between objects, and environmental attributes for semantic search.

Natural language scene descriptions
Spatial relationship extraction
Environmental and contextual attribute detection

Face Detection and Analysis

Detect faces in images and video with landmark localization. Extract face embeddings for similarity search across your visual data while respecting privacy configurations.

Face detection with landmark localization
Face embedding extraction for similarity search
Configurable privacy controls and redaction

Optical Character Recognition

Extract text from images, screenshots, and scanned documents. OCR output is indexed alongside visual features, enabling search across both visual and textual content in a single query.

Printed and handwritten text extraction
Layout-aware document parsing
Multi-language text recognition

Visual Similarity Search

Generate dense visual embeddings from images and search by visual similarity. Find visually similar products, scenes, or objects across millions of images in milliseconds.

Dense visual embedding generation
Cross-modal search (text query to image results)
Sub-100ms retrieval at scale with Qdrant

Supported Models

Run state-of-the-art computer vision models as feature extractors, or bring your own via the Docker plugin system.

Object Detection

Detect and localize objects with bounding boxes

YOLO v8/v9DETRFaster R-CNNSSD

Image Embeddings

Generate dense visual embeddings for similarity search

CLIPSigLIPDINOv2EVA-02

OCR

Extract text from images and documents

PaddleOCRTesseractTrOCRDocTR

Scene Understanding

Generate natural language descriptions of visual content

LLaVAInstructBLIPGPT-4V compatible

Face Analysis

Detect faces and generate face embeddings

InsightFaceArcFaceRetinaFace

Classification

Classify images into categories and labels

ViTConvNeXtEfficientNetCustom models

Industry Applications

Computer vision search infrastructure powers critical workflows across industries where visual data drives decisions.

E-Commerce and Retail

Power visual product search, automated catalog tagging, and similar product recommendations. Detect products in user-uploaded images and match them to your catalog using visual similarity. Automate quality control with defect detection on product images.

Manufacturing and Quality Control

Inspect products on assembly lines using object detection and defect classification. Build searchable archives of inspection images. Flag anomalies automatically and enable engineers to search historical defect patterns across production runs.

Security and Surveillance

Analyze surveillance footage at scale. Detect objects, track movement patterns, and search across camera feeds by visual attributes. Build retrievable indexes of security footage with object and event metadata.

Healthcare and Life Sciences

Index medical images for research and clinical workflows. Enable similarity search across radiology scans, pathology slides, and clinical photographs. Extract measurements and annotations from medical imaging data.

Mixpeek vs. Computer Vision Alternatives

See how Mixpeek compares to dedicated computer vision APIs for visual search and analysis use cases.

Feature	Mixpeek	Clarifai	Google Vision	AWS Rekognition
Modality Support	Vision + video + audio + text + PDF (unified)	Vision + video + text (separate APIs)	Vision only (separate video API)	Vision + video (separate APIs)
Search Infrastructure	Built-in hybrid search (vector + keyword)	Visual search (limited text search)	Label-based only (BYO search)	Face search only (BYO general search)
Custom Models	Docker-based plugin system on Ray GPU clusters	Platform-hosted training	AutoML Vision training	Custom Labels (limited)
Retriever Pipelines	Composable multi-stage (filter, search, rerank)	Workflows (sequential processing)	Not available	Not available
Embedding Generation	50+ extractors (CLIP, SigLIP, DINOv2, custom)	Platform-specific embeddings	Not exposed	Face vectors only
Deployment Options	Managed, Dedicated, BYO Cloud	Managed SaaS or on-prem	Google Cloud only	AWS only

Build Visual Search in Minutes

A simple Python API to detect objects, classify images, and search visual content with composable retriever pipelines.

computer_vision.py

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Create a collection with computer vision extractors
collection = client.collections.create(
    name="product-catalog",
    namespace="products",
    extractors=[
        {
            "type": "image_embedding",
            "model": "clip-vit-large",
            "config": {
                "resolution": 336
            }
        },
        {
            "type": "object_detection",
            "model": "yolov8-x",
            "config": {
                "confidence_threshold": 0.5,
                "classes": ["product", "label", "barcode"]
            }
        },
        {
            "type": "image_classification",
            "model": "vit-large",
            "config": {
                "taxonomy_id": "product-categories"
            }
        }
    ]
)

# Upload images to trigger feature extraction
client.buckets.upload(
    bucket="my-bucket",
    files=["product_001.jpg", "product_002.jpg"],
    collection=collection.id
)

# Search by text description (cross-modal)
results = client.retrievers.execute(
    namespace="products",
    stages=[
        {
            "type": "feature_search",
            "method": "hybrid",
            "query": {
                "text": "red running shoes with white sole",
                "modalities": ["image"]
            },
            "limit": 20
        },
        {
            "type": "filter",
            "conditions": {
                "metadata.objects": {"$contains": "product"},
                "metadata.category": "footwear"
            }
        },
        {
            "type": "rerank",
            "model": "cross-encoder",
            "limit": 10
        }
    ]
)

for result in results:
    print(f"Product: {result.metadata['filename']}")
    print(f"Category: {result.metadata['category']}")
    print(f"Objects: {result.metadata['objects']}")
    print(f"Score: {result.score}")

Frequently Asked Questions

What is a computer vision API?

A computer vision API provides programmatic access to visual analysis capabilities -- object detection, image classification, face detection, OCR, and visual similarity search. Mixpeek goes beyond single-image analysis by providing a complete computer vision infrastructure that extracts features, indexes them into searchable namespaces, and enables composable retriever pipelines for querying visual content at scale.

How does object detection work in Mixpeek?

Mixpeek runs object detection models (YOLO, DETR, Faster R-CNN) as feature extractors on Ray GPU clusters. When images or video frames are ingested through a collection, the object detection extractor identifies objects with bounding boxes and confidence scores. These detections are stored as structured metadata on documents in your Qdrant namespace, making them filterable and searchable through retriever pipelines.

Can I use my own computer vision models with Mixpeek?

Yes. Mixpeek supports custom models via its Docker-based plugin system. Package your model in a container, register it as a custom feature extractor, and deploy it on Mixpeek's Ray GPU clusters. Your custom model runs alongside built-in extractors in the same collection pipeline, with the same auto-scaling and monitoring infrastructure.

How does visual similarity search work?

Mixpeek generates dense visual embeddings from images using models like CLIP, SigLIP, or DINOv2. These embeddings are indexed in Qdrant namespaces. When you search, you can query with text (cross-modal search) or with another image (image-to-image similarity). The retriever pipeline calculates vector similarity and returns the most visually similar results, optionally combined with metadata filtering and reranking.

What image formats does Mixpeek support?

Mixpeek supports JPEG, PNG, TIFF, WebP, BMP, GIF, and SVG image formats. For video, it supports MP4, MOV, AVI, MKV, and WebM, extracting frames for visual analysis. Images are uploaded to Mixpeek buckets and processed by feature extractors automatically. The platform handles format conversion, resizing, and normalization during ingestion.

How is Mixpeek different from Google Vision API or AWS Rekognition?

Google Vision and AWS Rekognition are single-image analysis APIs -- you send an image and get labels or detections back. Mixpeek is a search infrastructure platform: it processes images through feature extraction pipelines, indexes results into vector namespaces, and provides composable retriever pipelines for querying. You also get multimodal support, so visual features can be searched alongside text, audio, and video content.

Does Mixpeek support real-time computer vision?

Mixpeek supports both real-time and batch processing. Collection triggers automatically process new images as they arrive in your bucket, enabling near-real-time feature extraction. For batch use cases, Mixpeek's Ray clusters scale horizontally to process millions of images in parallel. Retriever queries execute in sub-100ms for production search applications.

Can I combine computer vision with text and audio search?

Yes. This is a core strength of Mixpeek. All modalities -- images, video, audio, text, and documents -- are indexed into the same namespace. A single retriever pipeline can search across visual features, text content, and audio simultaneously. For example, you can search for video content matching a text description and filter by detected objects and transcribed speech in one query.

Build Production Computer Vision Infrastructure

Stop calling single-image APIs. Build searchable visual intelligence with managed GPU infrastructure that scales from prototypes to millions of images.