Foundation models trained on large corpora, often extended to multimodal use with image or audio inputs.
Large Language Models (LLMs) are trained on vast amounts of text data to understand and generate human-like language. They can be extended to multimodal tasks by incorporating image or audio inputs, enabling cross-modal understanding and generation.
LLMs use transformer architectures to process and generate text. They can be fine-tuned for specific tasks or extended with additional modalities using techniques like cross-attention and multimodal embeddings.