Deep Learning
Backpropagation
Backpropagation as reverse-mode autodiff — the chain rule over the computational graph, the gradients for a linear layer and ReLU, and why gradients vanish — with a runnable manual backward pass.
Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter, efficiently, by applying the chain rule backwards through the network's computational graph. It's reverse-mode automatic differentiation: one forward pass to compute the output and cache intermediate values, one backward pass to propagate gradients from the loss back to the inputs and weights.
The reason it matters: a modern network has billions of parameters, and backprop computes all their gradients in roughly the cost of a single forward pass — not one-pass-per-parameter. Every optimizer (SGD, Adam) consumes the gradients backprop produces. The frame to hold: backprop is just the chain rule, organized to reuse shared sub-computations by walking the graph in reverse. Interviewers probe whether you understand why reverse-mode, what gets cached, and how this connects to vanishing/exploding gradients.
Every operation in the network is a node in a graph with known local derivatives. The forward pass computes outputs and stores the activations needed later. The backward pass starts with at the loss and multiplies by each node's local Jacobian as it moves toward the inputs, accumulating gradients. Because many parameters share downstream paths, computing gradients from the output backward reuses work — whereas going forward-per-input would recompute it.
flowchart LR
X["x"] --> Z["z = Wx + b"]
Z --> A["a = ReLU(z)"]
A --> Lhat["ŷ = Va"]
Lhat --> L["Loss"]
L -. "dL/dŷ" .-> Lhat
Lhat -. "dL/da, dL/dV" .-> A
A -. "dL/dz" .-> Z
Z -. "dL/dW, dL/dx" .-> X
The dashed arrows are the backward pass: each carries a gradient computed from the one downstream of it times a local derivative.
The chain rule is the whole engine. For a scalar loss and a layer computing , given the upstream gradient :
For a linear layer with upstream gradient :
For an element-wise non-linearity , the local Jacobian is diagonal, so gradients flow through as an element-wise product:
Stacking layers multiplies these terms. That product is exactly why gradients vanish (repeated multiplication by factors , e.g. saturated sigmoids) or explode (factors ) in deep nets — motivating ReLU, residual connections, and normalization.
A minimal manual backward pass for Linear → ReLU → MSE:
1import numpy as np
2
3rng = np.random.default_rng(0)
4x = rng.normal(size=(8, 4)) # batch 8, in-dim 4
5y = rng.normal(size=(8, 2)) # targets, out-dim 2
6W = rng.normal(size=(4, 2)) * 0.1
7b = np.zeros(2)
8
9# ---- forward (cache activations) ----
10z = x @ W + b
11a = np.maximum(0, z) # ReLU
12loss = np.mean((a - y) ** 2)
13
14# ---- backward (chain rule) ----
15da = (2 / a.size) * (a - y) # dL/da from MSE
16dz = da * (z > 0) # ReLU': passes grad where z > 0
17dW = x.T @ dz # dL/dW = xᵀ δ
18db = dz.sum(axis=0) # dL/db = sum δ
19dx = dz @ W.T # dL/dx = δ Wᵀ
20
21# gradient check against finite differences
22eps = 1e-5
23W_pert = W.copy(); W_pert[0, 0] += eps
24loss_pert = np.mean((np.maximum(0, x @ W_pert + b) - y) ** 2)
25assert abs((loss_pert - loss) / eps - dW[0, 0]) < 1e-3- Conceptual: What does backpropagation actually compute, and how does it relate to the chain rule? (The gradient of the loss w.r.t. all parameters, by applying the chain rule backward through the computational graph.)
- Implementation: Why reverse-mode autodiff rather than forward-mode for neural nets? (One scalar loss, many parameters — reverse-mode computes all gradients in ~one backward pass; forward-mode costs one pass per input dimension.)
- Applied: What must the forward pass store for the backward pass to work? (The intermediate activations / inputs to each op — e.g., z for ReLU', x for the weight gradient.)
- Systems-level: What is gradient checkpointing and what does it trade? (Recompute some activations during the backward pass instead of storing them — saves memory at the cost of extra compute.)
- Failure modes: Why do gradients vanish or explode in deep networks, in terms of backprop? (The backward pass multiplies many local Jacobians; products of factors <1 shrink toward zero, >1 blow up — fixed by ReLU, residuals, normalization.)
From memory: write the three gradients for a linear layer (dW, db, dx) given upstream δ, and the ReLU backward rule. Explain in one sentence why reverse-mode is the right choice and what causes vanishing gradients. Check against Stages 3–5.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
Optimization
Gradient Descent (SGD, Momentum, Adam)
SGD, momentum, and Adam explained — the update rules, why mini-batching wins, Adam's bias correction, and when plain SGD generalizes better — with from-scratch implementations.
Deep Learning
Batch Normalization
What Batch Norm normalizes and why, the critical train-vs-inference distinction, BN vs. Layer Norm, with the math and a from-scratch PyTorch implementation.
Deep Learning
Attention Mechanisms
How scaled dot-product and multi-head attention work — the soft key-value lookup at the heart of every Transformer — with the math, runnable PyTorch, and calibrated interview questions.