A class of generative models that learn to create data by iteratively denoising random noise through a learned reverse diffusion process. Diffusion models power state-of-the-art image, video, and audio generation in multimodal AI systems.
Diffusion models have two phases: forward diffusion gradually adds Gaussian noise to data over many steps until it becomes pure noise, and reverse diffusion learns to remove noise step by step to reconstruct data. A neural network is trained to predict and remove the noise at each step. At generation time, the model starts from random noise and iteratively denoises to produce a realistic sample.
Architectures include U-Net (Stable Diffusion, DALL-E 2) and Transformer-based (DiT, used in Sora). Training uses denoising score matching with 1000 or more diffusion steps. Inference uses accelerated samplers (DDIM, DPM-Solver) to reduce steps to 20-50. Conditioning on text prompts uses cross-attention with CLIP or T5 text embeddings. Classifier-free guidance scales control the strength of text conditioning.