A visual representation of audio that maps frequency content over time using the mel scale, which approximates human auditory perception. Mel spectrograms are the standard input representation for most neural audio processing models in multimodal AI systems.
A mel spectrogram is computed by first applying a Short-Time Fourier Transform (STFT) to convert audio from the time domain to the frequency domain, then mapping the linear frequency axis to the mel scale using triangular filter banks. The mel scale compresses higher frequencies logarithmically to match how humans perceive pitch. The result is a 2D image-like representation with time on one axis and mel-frequency bands on the other.
Standard parameters include 128 mel bands, a window size of 25ms, a hop length of 10ms, and sample rate of 16-22kHz. The output is often converted to log-mel scale (decibels) and normalized. MFCCs (Mel-Frequency Cepstral Coefficients) apply a DCT to the mel spectrogram for a more compact representation. Modern deep learning models prefer raw mel spectrograms over MFCCs as input, as the network learns optimal features from the richer representation.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS