A computer vision task that identifies the type of scene or environment depicted in an image, such as beach, office, or highway. Scene recognition adds valuable contextual metadata to visual content in multimodal search and organization systems.
Scene recognition models analyze the global structure and composition of an image to classify it into a scene category. Unlike object detection which focuses on individual items, scene recognition captures the overall environment and spatial layout. Models trained on scene datasets learn holistic features including spatial arrangement, texture patterns, and typical object configurations.
Models are typically pretrained on Places365 (365 scene categories) or Places205 datasets. Architectures include ResNet, DenseNet, and Vision Transformers fine-tuned for scene classification. Scene features are complementary to object features and are often extracted from middle network layers that capture spatial layout. Multi-scale feature aggregation helps capture both local textures and global scene structure.