The process of combining information from multiple data modalities to create a unified representation or make better predictions.
Multimodal fusion combines signals from different data types (text, image, audio, etc.) to create more comprehensive and accurate representations. This can happen at early, late, or intermediate stages of processing.
Uses various techniques like attention mechanisms, cross-modal transformers, and neural networks to align and combine information from different modalities. Can be implemented at feature, decision, or hybrid levels.