Technology that converts written text into spoken audio with natural intonation and rhythm. TTS enables audio generation from text content, accessibility features, and voice interfaces in multimodal applications.
Modern TTS systems use a two-stage pipeline: a text-to-spectrogram model that converts text into a mel spectrogram capturing speech patterns, and a vocoder that converts the spectrogram into an audio waveform. The first stage handles linguistic aspects (pronunciation, prosody, pacing), while the vocoder generates high-fidelity audio. End-to-end models combine both stages.
Leading architectures include Tacotron 2 (autoregressive), FastSpeech 2 (non-autoregressive, faster), VITS (end-to-end), and XTTS (multi-speaker, multilingual). Neural vocoders like HiFi-GAN generate 24kHz audio at real-time or faster speeds. Zero-shot voice cloning models can mimic a speaker's voice from a short reference clip. Quality is evaluated using Mean Opinion Score (MOS) from human listeners.