What is backpropagation?

An algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule backward through the network's computational graph — reverse-mode automatic differentiation.

Why use reverse-mode autodiff instead of forward-mode for neural networks?

There is one scalar loss but many parameters. Reverse-mode computes all gradients in roughly one backward pass; forward-mode would cost one pass per input dimension.

What does the forward pass need to store for backpropagation?

The intermediate activations and inputs to each operation — for example z for the ReLU derivative and x for the weight gradient — which the backward pass reuses.

Why do gradients vanish or explode in deep networks?

The backward pass multiplies many local Jacobians together; products of factors below 1 shrink toward zero (vanishing) and above 1 blow up (exploding). ReLU, residuals, and normalization mitigate this.

Backpropagation Explained (ML Interview)

Backpropagation as reverse-mode autodiff — the chain rule over the computational graph, the gradients for a linear layer and ReLU, and why gradients vanish — with a runnable manual backward pass.

1Big Picture

Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter, efficiently, by applying the chain rule backwards through the network's computational graph. It's reverse-mode automatic differentiation: one forward pass to compute the output and cache intermediate values, one backward pass to propagate gradients from the loss back to the inputs and weights.

The reason it matters: a modern network has billions of parameters, and backprop computes all their gradients in roughly the cost of a single forward pass — not one-pass-per-parameter. Every optimizer (SGD, Adam) consumes the gradients backprop produces. The frame to hold: backprop is just the chain rule, organized to reuse shared sub-computations by walking the graph in reverse. Interviewers probe whether you understand why reverse-mode, what gets cached, and how this connects to vanishing/exploding gradients.

2Intuition + Visual

Every operation in the network is a node in a graph with known local derivatives. The forward pass computes outputs and stores the activations needed later. The backward pass starts with $\partial L / \partial L = 1$ at the loss and multiplies by each node's local Jacobian as it moves toward the inputs, accumulating gradients. Because many parameters share downstream paths, computing gradients from the output backward reuses work — whereas going forward-per-input would recompute it.

The dashed arrows are the backward pass: each carries a gradient computed from the one downstream of it times a local derivative.

3The Math

The chain rule is the whole engine. For a scalar loss $L$ and a layer computing $y = f(x)$ , given the upstream gradient $\frac{\partial L}{\partial y}$ :

\frac{\partial L}{\partial x} = \left(\frac{\partial y}{\partial x}\right)^{\!\top} \frac{\partial L}{\partial y}

For a linear layer $z = Wx + b$ with upstream gradient $\delta = \frac{\partial L}{\partial z}$ :

\frac{\partial L}{\partial W} = \delta\, x^\top, \qquad \frac{\partial L}{\partial b} = \delta, \qquad \frac{\partial L}{\partial x} = W^\top \delta

For an element-wise non-linearity $a = \phi(z)$ , the local Jacobian is diagonal, so gradients flow through as an element-wise product:

\frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \odot \phi'(z)

Stacking layers multiplies these terms. That product is exactly why gradients vanish (repeated multiplication by factors $<1$ , e.g. saturated sigmoids) or explode (factors $>1$ ) in deep nets — motivating ReLU, residual connections, and normalization.

4Implementation

A minimal manual backward pass for Linear → ReLU → MSE:

python

1import numpy as np
2
3rng = np.random.default_rng(0)
4x = rng.normal(size=(8, 4))           # batch 8, in-dim 4
5y = rng.normal(size=(8, 2))           # targets, out-dim 2
6W = rng.normal(size=(4, 2)) * 0.1
7b = np.zeros(2)
8
9# ---- forward (cache activations) ----
10z = x @ W + b
11a = np.maximum(0, z)                   # ReLU
12loss = np.mean((a - y) ** 2)
13
14# ---- backward (chain rule) ----
15da = (2 / a.size) * (a - y)           # dL/da from MSE
16dz = da * (z > 0)                     # ReLU': passes grad where z > 0
17dW = x.T @ dz                         # dL/dW = xᵀ δ
18db = dz.sum(axis=0)                   # dL/db = sum δ
19dx = dz @ W.T                         # dL/dx = δ Wᵀ
20
21# gradient check against finite differences
22eps = 1e-5
23W_pert = W.copy(); W_pert[0, 0] += eps
24loss_pert = np.mean((np.maximum(0, x @ W_pert + b) - y) ** 2)
25assert abs((loss_pert - loss) / eps - dW[0, 0]) < 1e-3

5Interview Questions

Conceptual: What does backpropagation actually compute, and how does it relate to the chain rule? (The gradient of the loss w.r.t. all parameters, by applying the chain rule backward through the computational graph.)
Implementation: Why reverse-mode autodiff rather than forward-mode for neural nets? (One scalar loss, many parameters — reverse-mode computes all gradients in ~one backward pass; forward-mode costs one pass per input dimension.)
Applied: What must the forward pass store for the backward pass to work? (The intermediate activations / inputs to each op — e.g., z for ReLU', x for the weight gradient.)
Systems-level: What is gradient checkpointing and what does it trade? (Recompute some activations during the backward pass instead of storing them — saves memory at the cost of extra compute.)
Failure modes: Why do gradients vanish or explode in deep networks, in terms of backprop? (The backward pass multiplies many local Jacobians; products of factors <1 shrink toward zero, >1 blow up — fixed by ReLU, residuals, normalization.)

6Retrieval Check

From memory: write the three gradients for a linear layer (dW, db, dx) given upstream δ, and the ReLU backward rule. Explain in one sentence why reverse-mode is the right choice and what causes vanishing gradients. Check against Stages 3–5.

Backpropagation

Related concepts