A visual representation of audio that maps frequency content over time using the mel scale, which approximates human auditory perception. Mel spectrograms are the standard input representation for most neural audio processing models in multimodal AI systems.
A mel spectrogram is computed by first applying a Short-Time Fourier Transform (STFT) to convert audio from the time domain to the frequency domain, then mapping the linear frequency axis to the mel scale using triangular filter banks. The mel scale compresses higher frequencies logarithmically to match how humans perceive pitch. The result is a 2D image-like representation with time on one axis and mel-frequency bands on the other.
Standard parameters include 128 mel bands, a window size of 25ms, a hop length of 10ms, and sample rate of 16-22kHz. The output is often converted to log-mel scale (decibels) and normalized. MFCCs (Mel-Frequency Cepstral Coefficients) apply a DCT to the mel spectrogram for a more compact representation. Modern deep learning models prefer raw mel spectrograms over MFCCs as input, as the network learns optimal features from the richer representation.