When should you use RAG instead of fine-tuning?

When knowledge is large, frequently changing, or proprietary, and you need grounding or citations — RAG updates by changing the index, not retraining the model.

How does RAG reduce hallucination?

It supplies the model with the actual source text to answer from, so it isn't forced to invent facts from parametric memory.

Why does chunking strategy matter in RAG?

Chunks too large dilute relevance and waste context; too small lose the surrounding meaning needed to answer. Both hurt retrieval and generation quality.

What is a re-ranker in RAG?

A stronger model that re-scores the top retrieved candidates for relevance — cheap vector search casts a wide net, and the re-ranker sharpens precision before generation.

RAG Explained (ML Interview)

How Retrieval-Augmented Generation grounds an LLM in external knowledge — the retrieve-augment-generate pipeline, chunking, retrieval quality, and re-rankers — with code.

1Big Picture

Retrieval-Augmented Generation (RAG) gives a language model access to external knowledge at inference time. Instead of relying only on what's baked into its weights, the system retrieves relevant documents for a query, augments the prompt with them, and lets the model generate an answer grounded in that context. This is how you add private, fresh, or domain-specific knowledge without retraining — and it sharply reduces hallucination by giving the model the facts to cite.

RAG is the dominant pattern for production LLM apps over proprietary data. The frame to hold: retrieve relevant context by semantic similarity, stuff it into the prompt, then generate. Interviewers probe why RAG over fine-tuning, how chunking and retrieval quality drive results, and how you evaluate it.

2Intuition + Visual

Offline, you chunk your documents and embed each chunk into a vector, stored in a vector database. At query time, you embed the user's question into the same space, find the nearest chunks (the ones most semantically similar), and paste them into the prompt as context. The model answers using that retrieved evidence rather than its parametric memory alone.

The retriever's job is to surface the right evidence; the generator's job is to synthesize it. If retrieval misses, generation can't recover — "garbage in, garbage out."

3The Math

Each document chunk $d_i$ and the query $q$ are mapped to vectors by an embedding model. Relevance is similarity in that space, typically cosine or dot product:

\text{score}(q, d_i) = \frac{e_q \cdot e_{d_i}}{\lVert e_q \rVert\, \lVert e_{d_i} \rVert}

Retrieve the top- $k$ chunks by score:

\text{Retrieved} = \operatorname{top\text{-}k}_i \; \text{score}(q, d_i)

then build the prompt as $[\text{instructions};\, d_{(1)}, \dots, d_{(k)};\, q]$ and decode. Two design levers dominate quality: chunking (chunks must be small enough to be specific but large enough to be self-contained) and retrieval quality (often improved with hybrid keyword + vector search and a re-ranker that re-scores the top candidates with a stronger model). Approximate nearest-neighbor indexes (HNSW, IVF) make top- $k$ search fast over millions of vectors.

4Implementation

python

1import numpy as np
2
3def embed(texts):                              # stand-in for a real embedding model
4    rng = np.random.default_rng(abs(hash(tuple(texts))) % 2**32)
5    return rng.normal(size=(len(texts), 384))
6
7def cosine(a, b):
8    return (a @ b.T) / (np.linalg.norm(a, axis=1)[:, None] * np.linalg.norm(b, axis=1))
9
10docs = ["Batch norm normalizes per feature.", "Attention scales by sqrt(d_k).", "RAG retrieves context."]
11doc_vecs = embed(docs)                          # offline: index the chunks
12
13def retrieve(query: str, k: int = 2) -> list[str]:
14    q_vec = embed([query])
15    scores = cosine(q_vec, doc_vecs)[0]         # similarity to every chunk
16    top = np.argsort(scores)[::-1][:k]          # top-k indices
17    return [docs[i] for i in top]
18
19context = retrieve("how does attention scale scores?")
20prompt = f"Context:\n" + "\n".join(context) + "\n\nQuestion: ...\nAnswer using only the context."

5Interview Questions

Conceptual: When would you use RAG instead of fine-tuning? (When knowledge is large, changing, or proprietary, and you need grounding/citations — RAG updates by changing the index, not the weights.)
Implementation: Why does chunking strategy matter so much? (Chunks too large dilute relevance and waste context; too small lose the surrounding meaning needed to answer — both hurt retrieval and generation.)
Applied: How does RAG reduce hallucination? (It supplies the model with the actual source text to answer from, so it's not forced to invent facts from parametric memory.)
Systems-level: What's a re-ranker and why add one? (A stronger model that re-scores the top-k retrieved candidates for relevance — cheap vector search casts a wide net, the re-ranker sharpens precision.)
Failure modes: What happens when retrieval fails, and how do you catch it? (Generation is grounded in wrong/irrelevant context and answers confidently wrong; you evaluate retrieval (recall@k) and generation (faithfulness/groundedness) separately.)

6Retrieval Check

Without looking: draw the retrieve→augment→generate pipeline, write the similarity-and-top-k retrieval step, and name the two levers (chunking, retrieval quality) that most affect RAG quality. Check against Stages 2–3.

Retrieval-Augmented Generation (RAG)

Related concepts