A computer vision task that assigns a label to every pixel in an image, delineating object boundaries precisely. Segmentation enables fine-grained visual understanding in multimodal systems beyond what bounding boxes provide.
Image segmentation models classify each pixel in an image into a category. Semantic segmentation assigns class labels to all pixels, instance segmentation distinguishes individual object instances, and panoptic segmentation combines both. Models use encoder-decoder architectures where the encoder extracts features and the decoder upsamples to produce pixel-level predictions.
Modern architectures include Mask R-CNN for instance segmentation, Segment Anything (SAM) for promptable segmentation, and SegFormer for efficient semantic segmentation. SAM introduced a foundation model approach where a single model handles arbitrary segmentation tasks via point, box, or text prompts. Output masks are typically stored as run-length encoded binary arrays for efficiency.