A task that assigns category labels to audio recordings based on their content, such as environmental sounds, music genres, or speech types. Audio classification enables automatic tagging and filtering of audio content in multimodal processing pipelines.
Audio classification models convert raw audio into a spectrogram representation, which is processed by a neural network to predict category labels. The audio waveform is transformed into a time-frequency representation (mel spectrogram), then fed through a convolutional or transformer-based classifier that outputs probabilities over predefined categories.
Models like AST (Audio Spectrogram Transformer), PANNs, and BEATs achieve strong results on AudioSet (527 event classes). Input is typically 10-second audio segments converted to 128-band mel spectrograms. Transfer learning from models pretrained on AudioSet is standard for domain-specific classification tasks. Multi-label classification handles audio containing multiple simultaneous events. Performance is measured using mAP (mean Average Precision).