A class of generative models that learn to create data by iteratively denoising random noise through a learned reverse diffusion process. Diffusion models power state-of-the-art image, video, and audio generation in multimodal AI systems.
Diffusion models have two phases: forward diffusion gradually adds Gaussian noise to data over many steps until it becomes pure noise, and reverse diffusion learns to remove noise step by step to reconstruct data. A neural network is trained to predict and remove the noise at each step. At generation time, the model starts from random noise and iteratively denoises to produce a realistic sample.
Architectures include U-Net (Stable Diffusion, DALL-E 2) and Transformer-based (DiT, used in Sora). Training uses denoising score matching with 1000 or more diffusion steps. Inference uses accelerated samplers (DDIM, DPM-Solver) to reduce steps to 20-50. Conditioning on text prompts uses cross-attention with CLIP or T5 text embeddings. Classifier-free guidance scales control the strength of text conditioning.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS