Concepts

ML & SWE interview concepts, explained at depth.

Each page is a full 6-stage walkthrough — intuition, the math, runnable code, the questions interviewers ask, and a retrieval check. Free to read. Drill follow-ups in a live session when you want to go further.

Start a session Get the deep-dives

Deep Learning

Attention Mechanisms

How scaled dot-product and multi-head attention work — the soft key-value lookup at the heart of every Transformer — with the math, runnable PyTorch, and calibrated interview questions.

9 min read

Batch Normalization

What Batch Norm normalizes and why, the critical train-vs-inference distinction, BN vs. Layer Norm, with the math and a from-scratch PyTorch implementation.

8 min read

Transformer Architecture

The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.

9 min read

Backpropagation

Backpropagation as reverse-mode autodiff — the chain rule over the computational graph, the gradients for a linear layer and ReLU, and why gradients vanish — with a runnable manual backward pass.

9 min read

Positional Encoding

Why self-attention needs position information and how to give it — sinusoidal, learned, and rotary (RoPE) encodings — with the math, PyTorch, and interview questions.

7 min read

Classical ML

Bias-Variance Tradeoff

The exact decomposition of expected error into bias, variance, and irreducible noise — how to diagnose under- vs. overfitting, with intuition, math, and a runnable demo.

8 min read

Logistic Regression

The workhorse linear classifier — sigmoid of a linear score, trained with cross-entropy — why not MSE, the decision boundary, and the softmax extension, with from-scratch code.

8 min read

Principal Component Analysis (PCA)

Principal Component Analysis for dimensionality reduction — the directions of maximal variance via eigenvectors/SVD, choosing k by explained variance, and why scaling matters — with code.

8 min read

Gradient Boosting & XGBoost

How gradient boosting builds a strong model from sequential weak trees fit to the negative gradient — boosting vs. bagging, the learning rate, and why XGBoost wins — with code.

8 min read

Optimization

Gradient Descent (SGD, Momentum, Adam)

SGD, momentum, and Adam explained — the update rules, why mini-batching wins, Adam's bias correction, and when plain SGD generalizes better — with from-scratch implementations.

8 min read

LLMs

LoRA and PEFT

LoRA (Low-Rank Adaptation) explained — freeze the base model and train a tiny low-rank ΔW = BA, why it works, the parameter savings, QLoRA, and zero-latency merging — with code.

8 min read

RLHF (Reinforcement Learning from Human Feedback)

Reinforcement Learning from Human Feedback explained — SFT, a reward model from preference comparisons, and PPO with a KL penalty — plus reward hacking and how DPO simplifies the pipeline.

9 min read

KV Cache

How the KV cache makes autoregressive LLM decoding affordable — what it stores and why reuse is valid, the memory cost, why decoding is memory-bandwidth-bound, and how MQA/GQA shrink it — with code.

8 min read

Tokenization & BPE

How text becomes tokens — Byte-Pair Encoding, why subword beats word- and character-level, the training algorithm, and why token count drives context and cost — with code.

7 min read

Embeddings

How discrete tokens become dense vectors that capture meaning — static (word2vec) vs. contextual embeddings, cosine similarity, and negative sampling — with code.

7 min read

Retrieval-Augmented Generation (RAG)

How Retrieval-Augmented Generation grounds an LLM in external knowledge — the retrieve-augment-generate pipeline, chunking, retrieval quality, and re-rankers — with code.

8 min read

Temperature, Top-k & Top-p Sampling

How LLMs choose the next token — temperature, top-k, top-p (nucleus), and beam search — the math, the tradeoffs, and a runnable PyTorch sampler.

7 min read