A machine learning paradigm where the model identifies the most informative unlabeled examples for human annotation, maximizing learning efficiency. Active learning reduces labeling costs for building multimodal AI training datasets.
Active learning operates in a loop: the model predicts on unlabeled data, a selection strategy identifies the most informative examples, a human annotator labels those examples, and the model is retrained on the expanded labeled set. By focusing labeling effort on examples the model is most uncertain about or that would most improve performance, active learning achieves better accuracy with fewer labels.
Selection strategies include uncertainty sampling (choose examples the model is least confident about), query-by-committee (choose examples where an ensemble disagrees), diversity sampling (choose examples that cover different regions of the data space), and expected model change. Pool-based active learning selects from a fixed pool, while stream-based processes examples one at a time. Batch active learning selects multiple examples per round for annotation efficiency.