A technique for generating compact, comparable representations of audio content, using spectrogram analysis and deep embedding models, to identify copyrighted music, sound trademarks, and audio signatures in video and audio files.
Audio fingerprinting converts sound into a visual representation called a mel spectrogram, which maps frequency content over time. Deep learning models (AST, CLAP, PANNs) then generate embedding vectors from these spectrograms. These embeddings are compared against a reference corpus using approximate nearest neighbor search to identify matches. The approach is robust to noise, compression, volume changes, and partial clips.
The pipeline starts with audio extraction (FFmpeg for video sources), followed by preprocessing (resampling to 16kHz, mono conversion, normalization). Mel spectrograms are computed with standard parameters (128 mel bands, 25ms window, 10ms hop). Multiple model architectures are viable: AST (Audio Spectrogram Transformer, 768d), CLAP (Contrastive Language-Audio Pretraining, 512d), PANNs (Pre-trained Audio Neural Networks, 2048d). Embeddings are indexed in a vector database for sub-millisecond retrieval.