Vector representations that encode the semantic and acoustic properties of audio signals into fixed-size dense vectors. Audio embeddings enable similarity search, classification, and cross-modal retrieval in multimodal systems that process sound.
Audio embedding models convert raw audio waveforms or spectrograms into fixed-size vectors that capture acoustic features, speaker characteristics, music qualities, or semantic content. The audio passes through a pretrained neural network encoder that maps variable-length audio into a fixed-dimensional embedding space where similar sounds are close together.
Models like CLAP (Contrastive Language-Audio Pretraining), VGGish, and OpenL3 produce audio embeddings of 128-512 dimensions. CLAP aligns audio and text in a shared embedding space (similar to CLIP for images). Audio is typically preprocessed into mel spectrograms (128 mel bands, 16kHz sample rate) before encoding. Self-supervised models like Audio-MAE and SSAST learn representations from unlabeled audio via masked spectrogram modeling.