Advanced·5 min read

RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation) is an architecture pattern that combines information retrieval with LLM generation. Instead of relying solely on

Definition

RAG (Retrieval-Augmented Generation) is an architecture pattern that combines information retrieval with LLM generation. Instead of relying solely on the model's parametric (baked-in) knowledge, RAG dynamically retrieves relevant documents from an external knowledge base at query time and injects them into the prompt before generation.

The Core Insight

LLMs know a lot — but their knowledge is frozen at training cutoff, can be wrong, and can't include your private data. RAG solves this by giving the model a "retrieval tool" to look things up at runtime.

`

Without RAG: Model answers from memory → may hallucinate, outdated

With RAG: Model answers from retrieved documents → grounded, current

`

RAG Architecture

Standard RAG Pipeline

`

1. INDEXING (offline, done once):

Documents → Chunking → Embedding → Vector Store

2. RETRIEVAL (online, per query):

User Query → Query Embedding → Vector Search → Top-K Chunks

3. GENERATION (online, per query):

System Prompt + Retrieved Chunks + User Query → LLM → Grounded Answer

`

Phase 1: Indexing

Document Loading

  • Load source documents: PDFs, Word docs, websites, databases, code files
  • Tools: LangChain loaders, LlamaIndex readers, Unstructured.io
  • Chunking

  • Split documents into smaller pieces (chunks) that fit in the context window
  • Strategies:
  • - Fixed-size: every N tokens with M token overlap

    - Recursive character splitting: split on paragraphs > sentences > words

    - Semantic chunking: split at semantic boundaries (topic shifts)

    - Document-structure aware: respect headers, sections, code blocks

    Embedding

  • Convert each chunk to a dense vector using an embedding model
  • The vector captures the semantic meaning of the chunk
  • Common models: OpenAI text-embedding-3-large, Cohere embed-v3, BGE, E5
  • Storage in Vector Database

  • Store (chunk_text, embedding_vector, metadata) in a vector DB
  • Popular vector DBs: Pinecone, Weaviate, Chroma, Qdrant, pgvector, FAISS
  • Phase 2: Retrieval

    Query Embedding

  • Embed the user's question using the same embedding model used for indexing
  • Query and chunk embeddings must be in the same vector space
  • Similarity Search

  • Find the top-K chunks most similar to the query embedding
  • Default metric: cosine similarity
  • K is typically 3–10 chunks
  • Retrieval Strategies

    | Strategy | Description | Best For |

    |----------|-------------|---------|

    | Semantic (dense) | Embedding-based similarity | Conceptual questions |

    | Keyword (sparse, BM25) | TF-IDF term matching | Exact term lookup |

    | Hybrid | Combine dense + sparse with RRF | General purpose |

    | Contextual compression | Re-rank + compress retrieved chunks | Precision |

    | Parent-child | Retrieve child, return parent chunk | Better coherence |

    | Multi-query | Generate N query variants, retrieve for each | Recall |

    | HyDE | Generate a hypothetical answer, retrieve similar to it | Complex queries |

    Phase 3: Generation (Augmented)

    `

    System Prompt: "Answer based only on the context below. If not found, say so."

    Context:

    [Chunk 1: relevant passage from doc A]

    [Chunk 2: relevant passage from doc B]

    [Chunk 3: relevant passage from doc C]

    Question: [user's query]

    Answer:

    `

    Advanced RAG Techniques

    Re-ranking

  • After initial retrieval, use a cross-encoder to re-rank chunks by relevance
  • Cross-encoders compare query + chunk together (slower but more accurate)
  • Models: Cohere Rerank, BAAI/bge-reranker, ColBERT
  • Query Transformation

  • Query rewriting: rephrase the query for better retrieval
  • HyDE: generate a hypothetical document the answer might come from, then retrieve similar
  • Multi-query: generate multiple query variants to increase recall
  • Self-RAG

  • Model decides whether to retrieve (not always necessary)
  • After retrieval, critiques relevance of retrieved documents
  • More efficient for mixed queries (some need retrieval, some don't)
  • Agentic RAG

  • Agent iteratively retrieves and reasons
  • "Read this chunk → need more info → retrieve again → synthesize"
  • Better for complex multi-hop questions
  • RAG Evaluation Metrics

    | Metric | Measures | Tool |

    |--------|---------|------|

    | Context Precision | Are retrieved chunks relevant? | RAGAS |

    | Context Recall | Were all relevant chunks retrieved? | RAGAS |

    | Faithfulness | Does answer match retrieved context? | RAGAS, TruLens |

    | Answer Relevance | Does answer address the question? | RAGAS |

    | End-to-End Accuracy | Is the final answer correct? | Human eval |

    RAG vs. Fine-Tuning

    | Aspect | RAG | Fine-Tuning |

    |--------|-----|-------------|

    | Knowledge updates | Easy (add to vector DB) | Requires retraining |

    | Custom facts | Excellent | Good |

    | Private data | Excellent | Possible but risky |

    | Cost | Retrieval infra | GPU compute |

    | Hallucination | Lower (grounded) | Lower (domain knowledge) |

    | Format/style | Limited | Strong |

    Rule of thumb: RAG for knowledge, fine-tuning for behavior/style.

    RAG Frameworks and Tools

    | Tool | Notes |

    |------|-------|

    | LangChain | Full RAG pipeline, many integrations |

    | LlamaIndex | Document-focused RAG, complex query engines |

    | Haystack | Enterprise RAG |

    | AWS Bedrock Knowledge Bases | Managed RAG on AWS |

    | Azure AI Search | Enterprise hybrid retrieval |

    | RAGAS | RAG evaluation framework |

    Common RAG Failure Modes

    | Failure | Cause | Fix |

    |---------|-------|-----|

    | Missing relevant chunks | Poor chunking or embedding | Better chunking strategy, hybrid retrieval |

    | Model ignores retrieved context | Weak grounding instructions | Stronger system prompt constraints |

    | Retrieved wrong chunks | Query-document mismatch | Query rewriting, re-ranking |

    | Long chunks overwhelm context | K too large or chunks too big | Smaller chunks, contextual compression |

    | Stale knowledge base | Index not updated | Regular re-indexing pipeline |

    Related Concepts

  • Embeddings, Vector Database, Grounding, Hallucination, Context Window, Chunking, Retrieval, LLM

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 7).