A technique that enhances language model generation by first retrieving relevant documents from a knowledge base and including them in the model's context. RAG reduces hallucination and enables up-to-date, domain-specific responses in multimodal AI applications.
RAG combines a retrieval system with a generative language model in a two-step process. First, the user's query is used to retrieve relevant documents from a knowledge base using semantic search or keyword matching. Then, the retrieved documents are provided as context to the language model, which generates a response grounded in the retrieved information. This grounds the model's output in actual data rather than relying solely on training knowledge.
The retrieval component uses dense retrieval (embedding similarity), sparse retrieval (BM25), or hybrid approaches. Retrieved documents are inserted into the LLM context window alongside the query. Chunking strategies (fixed-size, semantic, recursive) determine how documents are split for indexing. Reranking improves the relevance of retrieved contexts. Advanced RAG includes query rewriting, multi-hop retrieval, and self-reflection on retrieval quality.