NEWWhy single embeddings fail for video.Read the post →

    AI Model Hub

    Browse AI models for multimodal decomposition and recomposition pipelines — plug any model into your extractors.

    9,588 models available

    Showing 82098232 of 9,588 models

    Voice Activity Detection

    pyannote/speaker-diarization-community-1-cloud

    68
    pyannote-audio
    Visual Question Answering

    AXERA-TECH/Janus-Pro-1B

    68
    2
    Image Feature Extraction

    Tooony133/dinov3-vitl16-pretrain-lvd1689m

    68
    transformers
    Image Feature Extraction

    timm/fastvit_mci4.apple_mclip2_dfndr2b

    68
    timm
    Depth Estimation

    AXERA-TECH/IGEV-plusplus

    68
    Video Classification

    archit11/videomae-base-finetuned-fight-nofight

    68
    transformers
    Text To Audio

    Beehzod/speechT5_tts_uzbek

    68
    2
    transformers
    Zero Shot Classification

    MoritzLaurer/ernie-m-large-mnli-xnli

    68
    18
    transformers
    Visual Question Answering

    mradermacher/MemOCR-7B-GGUF

    68
    1
    transformers
    Image Feature Extraction

    hf-tiny-model-private/tiny-random-ResNetModel

    68
    transformers
    Text To Audio

    bmiller22000/xyntrai-csm-sesame-tts-nsfw

    68
    4
    transformers
    Visual Question Answering

    mPLUG/mPLUG-Owl3-7B-241101

    67
    10
    Image Feature Extraction

    facebook/PE-Lang-L14-448

    67
    7
    perception-encoder
    Video Classification

    adeelhasan/videomae-base-finetuned-kinetics-finetuned-RTFeed

    67
    transformers
    Video Classification

    Jeyseb/videomae-base-finetuned-rwf2000-subset___v4

    67
    transformers
    Text To Audio

    alakxender/mms-tts-div-ft-spk01-f01

    67
    1
    transformers
    Text To Audio

    piyazon/TTS-CV-Unique-Ug-2

    67
    4
    transformers
    Zero Shot Classification

    Mel-Iza0/zero-shot

    67
    2
    transformers
    Depth Estimation

    AXERA-TECH/RAFT-stereo

    66
    Zero Shot Classification

    mjwong/multilingual-e5-large-xnli

    66
    6
    transformers
    Image Feature Extraction

    facebook/vit-msn-base

    66
    transformers
    Image Feature Extraction

    timm/vit_base_patch32_siglip_256.v2_webli

    66
    1
    timm
    Video Classification

    archit11/videomae-base-finetuned-ucfcrime-full2

    66
    transformers
    Text To Audio

    hamza-amin/mms-tts-urd-fine-tuned

    66
    transformers
    343 / 400