A foundational computer vision task that predicts one or more class labels for a given image. Image classification underpins content organization, filtering, and routing in multimodal data processing pipelines.
Image classification models take an image as input and output a probability distribution over predefined classes. The image passes through a feature extraction backbone (CNN or Vision Transformer) that produces a representation vector, which is then mapped to class probabilities via a classification head. The class with the highest probability is selected as the prediction.
Modern classifiers use Vision Transformers (ViT, DeiT) or efficient ConvNets (EfficientNet, ConvNeXt) pretrained on ImageNet-21K or larger datasets. Transfer learning through fine-tuning the classifier head or the full model on domain data is standard practice. Multi-label classification uses sigmoid outputs instead of softmax for images belonging to multiple categories. Top-1 and top-5 accuracy are standard evaluation metrics.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS