A branch of artificial intelligence that trains models to extract meaningful information from images, video, and other visual inputs, enabling tasks like object detection, classification, segmentation, and scene understanding.
Computer vision systems process visual data through neural network architectures trained on large datasets of labeled images. The models learn hierarchical visual features -- edges and textures at lower layers, shapes and parts at middle layers, and complete objects and scenes at higher layers. Given a new image, the model applies these learned features to perform tasks like classification (what is in the image), detection (where are specific objects), segmentation (pixel-level labeling), and embedding (producing a vector representation for similarity search). These capabilities form the foundation of visual search, content moderation, and video understanding systems.
Modern computer vision is dominated by transformer architectures (Vision Transformers, or ViT) and convolutional neural networks (CNNs like ResNet, EfficientNet). Vision transformers split images into patches and process them as sequences, enabling attention across spatial regions. For multimodal applications, vision-language models like CLIP and SigLIP learn joint embedding spaces for images and text, enabling cross-modal retrieval. Mixpeek leverages these models in its image and video feature extractors, applying them at scale through its distributed processing pipeline.