LLMs
Retrieval-Augmented Generation (RAG)
How Retrieval-Augmented Generation grounds an LLM in external knowledge — the retrieve-augment-generate pipeline, chunking, retrieval quality, and re-rankers — with code.
Retrieval-Augmented Generation (RAG) gives a language model access to external knowledge at inference time. Instead of relying only on what's baked into its weights, the system retrieves relevant documents for a query, augments the prompt with them, and lets the model generate an answer grounded in that context. This is how you add private, fresh, or domain-specific knowledge without retraining — and it sharply reduces hallucination by giving the model the facts to cite.
RAG is the dominant pattern for production LLM apps over proprietary data. The frame to hold: retrieve relevant context by semantic similarity, stuff it into the prompt, then generate. Interviewers probe why RAG over fine-tuning, how chunking and retrieval quality drive results, and how you evaluate it.
Offline, you chunk your documents and embed each chunk into a vector, stored in a vector database. At query time, you embed the user's question into the same space, find the nearest chunks (the ones most semantically similar), and paste them into the prompt as context. The model answers using that retrieved evidence rather than its parametric memory alone.
flowchart LR
Q["User query"] --> EQ["Embed query"]
EQ --> R["Vector DB: top-k nearest chunks"]
D["Documents -> chunk -> embed -> index"] --> R
R --> P["Prompt = query + retrieved chunks"]
P --> G["LLM generates grounded answer"]
The retriever's job is to surface the right evidence; the generator's job is to synthesize it. If retrieval misses, generation can't recover — "garbage in, garbage out."
Each document chunk and the query are mapped to vectors by an embedding model. Relevance is similarity in that space, typically cosine or dot product:
Retrieve the top- chunks by score:
then build the prompt as and decode. Two design levers dominate quality: chunking (chunks must be small enough to be specific but large enough to be self-contained) and retrieval quality (often improved with hybrid keyword + vector search and a re-ranker that re-scores the top candidates with a stronger model). Approximate nearest-neighbor indexes (HNSW, IVF) make top- search fast over millions of vectors.
1import numpy as np
2
3def embed(texts): # stand-in for a real embedding model
4 rng = np.random.default_rng(abs(hash(tuple(texts))) % 2**32)
5 return rng.normal(size=(len(texts), 384))
6
7def cosine(a, b):
8 return (a @ b.T) / (np.linalg.norm(a, axis=1)[:, None] * np.linalg.norm(b, axis=1))
9
10docs = ["Batch norm normalizes per feature.", "Attention scales by sqrt(d_k).", "RAG retrieves context."]
11doc_vecs = embed(docs) # offline: index the chunks
12
13def retrieve(query: str, k: int = 2) -> list[str]:
14 q_vec = embed([query])
15 scores = cosine(q_vec, doc_vecs)[0] # similarity to every chunk
16 top = np.argsort(scores)[::-1][:k] # top-k indices
17 return [docs[i] for i in top]
18
19context = retrieve("how does attention scale scores?")
20prompt = f"Context:\n" + "\n".join(context) + "\n\nQuestion: ...\nAnswer using only the context."- Conceptual: When would you use RAG instead of fine-tuning? (When knowledge is large, changing, or proprietary, and you need grounding/citations — RAG updates by changing the index, not the weights.)
- Implementation: Why does chunking strategy matter so much? (Chunks too large dilute relevance and waste context; too small lose the surrounding meaning needed to answer — both hurt retrieval and generation.)
- Applied: How does RAG reduce hallucination? (It supplies the model with the actual source text to answer from, so it's not forced to invent facts from parametric memory.)
- Systems-level: What's a re-ranker and why add one? (A stronger model that re-scores the top-k retrieved candidates for relevance — cheap vector search casts a wide net, the re-ranker sharpens precision.)
- Failure modes: What happens when retrieval fails, and how do you catch it? (Generation is grounded in wrong/irrelevant context and answers confidently wrong; you evaluate retrieval (recall@k) and generation (faithfulness/groundedness) separately.)
Without looking: draw the retrieve→augment→generate pipeline, write the similarity-and-top-k retrieval step, and name the two levers (chunking, retrieval quality) that most affect RAG quality. Check against Stages 2–3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
LLMs
Embeddings
How discrete tokens become dense vectors that capture meaning — static (word2vec) vs. contextual embeddings, cosine similarity, and negative sampling — with code.
LLMs
Tokenization & BPE
How text becomes tokens — Byte-Pair Encoding, why subword beats word- and character-level, the training algorithm, and why token count drives context and cost — with code.
LLMs
KV Cache
How the KV cache makes autoregressive LLM decoding affordable — what it stores and why reuse is valid, the memory cost, why decoding is memory-bandwidth-bound, and how MQA/GQA shrink it — with code.