A model compression technique that removes unnecessary weights or structures from neural networks to reduce size and computation without significantly affecting performance. Pruning helps deploy multimodal AI models efficiently on resource-constrained environments.
Pruning identifies and removes neural network parameters that contribute least to the model's output. Unstructured pruning zeros out individual weights based on magnitude or gradient importance. Structured pruning removes entire neurons, channels, or attention heads, producing models that are genuinely smaller and faster on standard hardware. After pruning, the model is typically fine-tuned to recover any lost accuracy.
Magnitude pruning removes weights with the smallest absolute values. Movement pruning considers how weights change during fine-tuning. Lottery ticket hypothesis suggests sparse subnetworks exist within dense networks that train equally well. Sparsity levels of 50-90% are common, with structured pruning typically achieving 2-4x speedup. The iterative prune-retrain cycle gradually increases sparsity while maintaining accuracy.