A technique for transferring knowledge from a large teacher model to a smaller student model, producing a compact model that approximates the teacher's performance. Model distillation is key to deploying multimodal AI models in production with low latency and cost.
The teacher model (large, accurate) generates soft predictions (probability distributions over classes or continuous outputs) for training data. The student model (small, fast) is trained to match these soft predictions rather than hard labels. Soft predictions contain richer information about inter-class relationships than hard labels, enabling the student to learn the teacher's reasoning patterns in a compressed form.
Knowledge distillation uses a temperature-scaled softmax where higher temperatures produce softer probability distributions. The student loss combines the distillation loss (KL divergence from teacher outputs) with the standard task loss (cross-entropy with true labels). Feature distillation transfers intermediate representations rather than just final outputs. Common compressions achieve 4-10x size reduction with less than 5% accuracy loss.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS