A foundational computer vision task that predicts one or more class labels for a given image. Image classification underpins content organization, filtering, and routing in multimodal data processing pipelines.
Image classification models take an image as input and output a probability distribution over predefined classes. The image passes through a feature extraction backbone (CNN or Vision Transformer) that produces a representation vector, which is then mapped to class probabilities via a classification head. The class with the highest probability is selected as the prediction.
Modern classifiers use Vision Transformers (ViT, DeiT) or efficient ConvNets (EfficientNet, ConvNeXt) pretrained on ImageNet-21K or larger datasets. Transfer learning through fine-tuning the classifier head or the full model on domain data is standard practice. Multi-label classification uses sigmoid outputs instead of softmax for images belonging to multiple categories. Top-1 and top-5 accuracy are standard evaluation metrics.