Concepts
ML & SWE interview concepts, explained at depth.
Each page is a full 6-stage walkthrough — intuition, the math, runnable code, the questions interviewers ask, and a retrieval check. Free to read. Drill follow-ups in a live session when you want to go further.
Deep Learning
Attention Mechanisms
How scaled dot-product and multi-head attention work — the soft key-value lookup at the heart of every Transformer — with the math, runnable PyTorch, and calibrated interview questions.
9 min read
Batch Normalization
What Batch Norm normalizes and why, the critical train-vs-inference distinction, BN vs. Layer Norm, with the math and a from-scratch PyTorch implementation.
8 min read
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
9 min read
Backpropagation
Backpropagation as reverse-mode autodiff — the chain rule over the computational graph, the gradients for a linear layer and ReLU, and why gradients vanish — with a runnable manual backward pass.
9 min read
Positional Encoding
Why self-attention needs position information and how to give it — sinusoidal, learned, and rotary (RoPE) encodings — with the math, PyTorch, and interview questions.
7 min read
Classical ML
Bias-Variance Tradeoff
The exact decomposition of expected error into bias, variance, and irreducible noise — how to diagnose under- vs. overfitting, with intuition, math, and a runnable demo.
8 min read
Logistic Regression
The workhorse linear classifier — sigmoid of a linear score, trained with cross-entropy — why not MSE, the decision boundary, and the softmax extension, with from-scratch code.
8 min read
Principal Component Analysis (PCA)
Principal Component Analysis for dimensionality reduction — the directions of maximal variance via eigenvectors/SVD, choosing k by explained variance, and why scaling matters — with code.
8 min read
Gradient Boosting & XGBoost
How gradient boosting builds a strong model from sequential weak trees fit to the negative gradient — boosting vs. bagging, the learning rate, and why XGBoost wins — with code.
8 min read
Optimization
LLMs
LoRA and PEFT
LoRA (Low-Rank Adaptation) explained — freeze the base model and train a tiny low-rank ΔW = BA, why it works, the parameter savings, QLoRA, and zero-latency merging — with code.
8 min read
RLHF (Reinforcement Learning from Human Feedback)
Reinforcement Learning from Human Feedback explained — SFT, a reward model from preference comparisons, and PPO with a KL penalty — plus reward hacking and how DPO simplifies the pipeline.
9 min read
KV Cache
How the KV cache makes autoregressive LLM decoding affordable — what it stores and why reuse is valid, the memory cost, why decoding is memory-bandwidth-bound, and how MQA/GQA shrink it — with code.
8 min read
Tokenization & BPE
How text becomes tokens — Byte-Pair Encoding, why subword beats word- and character-level, the training algorithm, and why token count drives context and cost — with code.
7 min read
Embeddings
How discrete tokens become dense vectors that capture meaning — static (word2vec) vs. contextual embeddings, cosine similarity, and negative sampling — with code.
7 min read
Retrieval-Augmented Generation (RAG)
How Retrieval-Augmented Generation grounds an LLM in external knowledge — the retrieve-augment-generate pipeline, chunking, retrieval quality, and re-rankers — with code.
8 min read
Temperature, Top-k & Top-p Sampling
How LLMs choose the next token — temperature, top-k, top-p (nucleus), and beam search — the math, the tradeoffs, and a runnable PyTorch sampler.
7 min read