The ability of AI models to classify inputs into arbitrary categories defined at inference time, without requiring labeled training data for those specific categories.
Zero-shot classification leverages models pretrained on broad datasets to classify inputs into categories the model has never been explicitly trained on. For text, models like BART or GPT encode both the input and candidate category labels into a shared representation space, then measure similarity to determine the best match. For images, vision-language models like CLIP and SigLIP encode the image and text labels into a joint embedding space and select the label with the highest similarity score. This approach enables instant classification without collecting and labeling training data for each new category.
Zero-shot classification works through two main mechanisms: natural language inference (NLI), where the model evaluates whether an input entails each candidate label, and embedding similarity, where both input and labels are encoded into vectors and compared. Vision-language approaches use contrastive learning to align image and text embeddings. At inference time, candidate labels are provided as text prompts, and the model computes similarity scores against the input. Mixpeek supports zero-shot classification through its taxonomy feature, which applies label sets to content during feature extraction without requiring per-label training data.