A signal processing task that determines which segments of an audio recording contain human speech versus silence, noise, or music. VAD is a critical preprocessing step for speech-related tasks in multimodal audio processing pipelines.
Voice activity detection analyzes audio frames (typically 20-30ms windows) to classify each frame as speech or non-speech. Traditional methods use energy levels, spectral features, and statistical models. Modern neural VAD models process longer context windows and learn discriminative features that distinguish speech from various noise types, producing binary speech/non-speech labels with timestamps.
State-of-the-art models include Silero VAD (lightweight, fast), pyannote VAD (neural, accurate), and WebRTC VAD (rule-based, ultra-fast). Output is a list of speech segments with start and end timestamps. Models operate on 16kHz mono audio and produce decisions at 10-30ms resolution. Metrics include false alarm rate, missed detection rate, and overall accuracy. Processing speed ranges from 50x to 500x real-time depending on model complexity.