A technique for transferring knowledge from a large teacher model to a smaller student model, producing a compact model that approximates the teacher's performance. Model distillation is key to deploying multimodal AI models in production with low latency and cost.
The teacher model (large, accurate) generates soft predictions (probability distributions over classes or continuous outputs) for training data. The student model (small, fast) is trained to match these soft predictions rather than hard labels. Soft predictions contain richer information about inter-class relationships than hard labels, enabling the student to learn the teacher's reasoning patterns in a compressed form.
Knowledge distillation uses a temperature-scaled softmax where higher temperatures produce softer probability distributions. The student loss combines the distillation loss (KL divergence from teacher outputs) with the standard task loss (cross-entropy with true labels). Feature distillation transfers intermediate representations rather than just final outputs. Common compressions achieve 4-10x size reduction with less than 5% accuracy loss.