An unsupervised technique that automatically discovers latent thematic structures in large collections of documents. Topic modeling organizes and surfaces content themes in multimodal systems where manual categorization is impractical.
Topic models analyze word co-occurrence patterns across a document collection to identify groups of words that frequently appear together, representing latent topics. Each document is represented as a mixture of topics, and each topic is a distribution over words. This enables automatic discovery of thematic structure without requiring predefined categories or labeled data.
Classical approaches include Latent Dirichlet Allocation (LDA) using probabilistic generative modeling and Non-negative Matrix Factorization (NMF). Modern neural topic models use variational autoencoders (ProdLDA, ETM) or leverage pretrained embeddings. BERTopic combines sentence embeddings, UMAP dimensionality reduction, and HDBSCAN clustering for state-of-the-art coherent topic discovery. Topic coherence and diversity metrics evaluate model quality.