AI Model Hub
Explore curated vision and multimodal AI models — embeddings, detection, segmentation, anomaly detection — plus thousands of HuggingFace models for your pipeline.
Curated Models
37 modelsopenai/clip-vit-large-patch14
Contrastive Language-Image Pre-Training for zero-shot visual understanding
google/siglip-base-patch16-224
Sigmoid Loss for Language Image Pre-Training, efficient contrastive learning
google/siglip2-giant-opt-patch16-384
Multilingual vision-language encoder with dense features and localization
facebook/dinov2-large
Self-supervised vision foundation model producing all-purpose visual features
facebook/dinov3-large
Next-generation self-supervised vision model with Gram anchoring and 6.7B scaling
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Open-source CLIP trained on 2B image-text pairs at giant scale
facebook/detr-resnet-50
End-to-end object detection with Transformers, no anchor boxes needed
hustvl/yolos-tiny
You Only Look at One Sequence, ViT-based real-time object detection
ultralytics/yolov8n
State-of-the-art real-time object detection, YOLO v8 Nano
AILab-CVC/YOLO-World-L
Real-time open-vocabulary object detection with text prompts
IDEA-Research/grounding-dino-base
Open-set detection using natural language descriptions
google/owlvit-large-patch14
Simple open-vocabulary object detection with Vision Transformers
deepinsight/retinaface-r50
Single-stage face detection with landmark localization
timesformer/facenet-pytorch
Deep face recognition with triplet loss embeddings
Salesforce/blip2-opt-2.7b
Bootstrapping Language-Image Pre-training with frozen LLMs
microsoft/Florence-2-large
Foundation model for unified vision tasks with sequence-to-sequence architecture
microsoft/trocr-large-printed
Transformer-based OCR for printed text recognition
PaddlePaddle/paddleocr
Ultra-lightweight, production-ready multilingual OCR system
openai/whisper-large-v3
Robust speech recognition trained on 680K hours of multilingual audio
facebook/wav2vec2-large-960h
Self-supervised speech representations for automatic speech recognition
pyannote/speaker-diarization-3.1
Who spoke when, end-to-end neural speaker diarization
laion/clap-htsat-fused
Contrastive Language-Audio Pretraining for audio-text retrieval
facebook/encodec_24khz
High-fidelity neural audio codec for compression and embeddings
BAAI/bge-large-en-v1.5
BAAI General Embedding, state-of-the-art text retrieval
sentence-transformers/all-MiniLM-L6-v2
Fast, lightweight sentence embeddings for semantic similarity
microsoft/layoutlmv3-base
Pre-trained multimodal transformer for document AI
naver-clova-ix/donut-base
Document understanding transformer, OCR-free document parsing
microsoft/table-transformer-detection
Detect and extract tables from document images
microsoft/codebert-base
Pre-trained model for code understanding and generation
Salesforce/codet5p-110m-embedding
Unified code understanding and generation with T5 architecture
facebook/sam-vit-huge
Promptable foundation model for image segmentation
facebook/sam2.1-hiera-large
Unified promptable segmentation for images and video with streaming memory
facebook/sam3
Concept-level segmentation with open-vocabulary detection and video tracking
netflix/void-model
Video object removal that preserves the physical interactions the object caused
amazon/patchcore-resnet50
Memory-bank anomaly detection achieving 99.6% AUROC on manufacturing defects
facebook/faiss
GPU-accelerated billion-scale vector similarity search and clustering
google/scann
Anisotropic vector quantization for efficient similarity search
