The process of separating desired audio content from background noise and interference using signal processing or neural network methods. Audio denoising improves the quality of audio inputs for downstream processing in multimodal AI pipelines.
Audio denoising models learn to separate clean speech or target audio from noise. Traditional methods use spectral subtraction or Wiener filtering. Modern neural approaches process the noisy spectrogram through a neural network that predicts either a clean spectrogram or a multiplicative mask that filters out noise. The model learns noise patterns from paired clean/noisy training examples.
State-of-the-art models include DNS (Deep Noise Suppression from Microsoft), Meta Demucs, and RNNoise. Architectures range from lightweight RNN models running in real-time on CPU to large U-Net models for maximum quality. Processing can operate in time domain (waveform) or frequency domain (spectrogram). Metrics include PESQ (perceptual quality), SI-SDR (signal-to-distortion ratio), and STOI (intelligibility).