AI Model Hub

    Explore curated vision and multimodal AI models — embeddings, detection, segmentation, anomaly detection — plus thousands of HuggingFace models for your pipeline.

    Curated Models

    37 models
    Visual Embeddings

    openai/clip-vit-large-patch14

    Contrastive Language-Image Pre-Training for zero-shot visual understanding

    428M
    Visual Embeddings

    google/siglip-base-patch16-224

    Sigmoid Loss for Language Image Pre-Training, efficient contrastive learning

    203M
    Visual Embeddings

    google/siglip2-giant-opt-patch16-384

    Multilingual vision-language encoder with dense features and localization

    1B
    Visual Embeddings

    facebook/dinov2-large

    Self-supervised vision foundation model producing all-purpose visual features

    300M
    Visual Embeddings

    facebook/dinov3-large

    Next-generation self-supervised vision model with Gram anchoring and 6.7B scaling

    300M (Large), 6.7B (ViT-7B)
    Visual Embeddings

    laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

    Open-source CLIP trained on 2B image-text pairs at giant scale

    1.8B
    Object Detection

    facebook/detr-resnet-50

    End-to-end object detection with Transformers, no anchor boxes needed

    42M
    Object Detection

    hustvl/yolos-tiny

    You Only Look at One Sequence, ViT-based real-time object detection

    6M
    Object Detection

    ultralytics/yolov8n

    State-of-the-art real-time object detection, YOLO v8 Nano

    3.2M
    Object Detection

    AILab-CVC/YOLO-World-L

    Real-time open-vocabulary object detection with text prompts

    ~100M
    Object Detection

    IDEA-Research/grounding-dino-base

    Open-set detection using natural language descriptions

    172M (Swin-B)
    Object Detection

    google/owlvit-large-patch14

    Simple open-vocabulary object detection with Vision Transformers

    ~300M
    Face Detection

    deepinsight/retinaface-r50

    Single-stage face detection with landmark localization

    27M
    Face Detection

    timesformer/facenet-pytorch

    Deep face recognition with triplet loss embeddings

    23M
    Scene Captioning

    Salesforce/blip2-opt-2.7b

    Bootstrapping Language-Image Pre-training with frozen LLMs

    3.7B
    Scene Captioning

    microsoft/Florence-2-large

    Foundation model for unified vision tasks with sequence-to-sequence architecture

    777M
    OCR

    microsoft/trocr-large-printed

    Transformer-based OCR for printed text recognition

    608M
    OCR

    PaddlePaddle/paddleocr

    Ultra-lightweight, production-ready multilingual OCR system

    12M
    Transcription

    openai/whisper-large-v3

    Robust speech recognition trained on 680K hours of multilingual audio

    1.5B
    Transcription

    facebook/wav2vec2-large-960h

    Self-supervised speech representations for automatic speech recognition

    317M
    Speaker Diarization

    pyannote/speaker-diarization-3.1

    Who spoke when, end-to-end neural speaker diarization

    18M
    Audio Embeddings

    laion/clap-htsat-fused

    Contrastive Language-Audio Pretraining for audio-text retrieval

    154M
    Audio Embeddings

    facebook/encodec_24khz

    High-fidelity neural audio codec for compression and embeddings

    23M
    Text Embeddings

    BAAI/bge-large-en-v1.5

    BAAI General Embedding, state-of-the-art text retrieval

    335M
    Text Embeddings

    sentence-transformers/all-MiniLM-L6-v2

    Fast, lightweight sentence embeddings for semantic similarity

    23M
    Document Structure

    microsoft/layoutlmv3-base

    Pre-trained multimodal transformer for document AI

    125M
    Document Structure

    naver-clova-ix/donut-base

    Document understanding transformer, OCR-free document parsing

    210M
    Table Extraction

    microsoft/table-transformer-detection

    Detect and extract tables from document images

    29M
    Code Extraction

    microsoft/codebert-base

    Pre-trained model for code understanding and generation

    125M
    Code Extraction

    Salesforce/codet5p-110m-embedding

    Unified code understanding and generation with T5 architecture

    110M
    Segmentation

    facebook/sam-vit-huge

    Promptable foundation model for image segmentation

    632M
    Segmentation

    facebook/sam2.1-hiera-large

    Unified promptable segmentation for images and video with streaming memory

    224.4M
    Segmentation

    facebook/sam3

    Concept-level segmentation with open-vocabulary detection and video tracking

    848M
    Segmentation

    netflix/void-model

    Video object removal that preserves the physical interactions the object caused

    5B
    Anomaly Detection

    amazon/patchcore-resnet50

    Memory-bank anomaly detection achieving 99.6% AUROC on manufacturing defects

    25M (ResNet-50 backbone)
    Vector Indexing

    facebook/faiss

    GPU-accelerated billion-scale vector similarity search and clustering

    N/A (library)
    Vector Indexing

    google/scann

    Anisotropic vector quantization for efficient similarity search

    N/A (library)