Skip to content

Deep Learning

Backpropagation

Backpropagation as reverse-mode autodiff — the chain rule over the computational graph, the gradients for a linear layer and ReLU, and why gradients vanish — with a runnable manual backward pass.

9 min readReviewed May 2026

1Big Picture

Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter, efficiently, by applying the chain rule backwards through the network's computational graph. It's reverse-mode automatic differentiation: one forward pass to compute the output and cache intermediate values, one backward pass to propagate gradients from the loss back to the inputs and weights.

The reason it matters: a modern network has billions of parameters, and backprop computes all their gradients in roughly the cost of a single forward pass — not one-pass-per-parameter. Every optimizer (SGD, Adam) consumes the gradients backprop produces. The frame to hold: backprop is just the chain rule, organized to reuse shared sub-computations by walking the graph in reverse. Interviewers probe whether you understand why reverse-mode, what gets cached, and how this connects to vanishing/exploding gradients.

2Intuition + Visual

Every operation in the network is a node in a graph with known local derivatives. The forward pass computes outputs and stores the activations needed later. The backward pass starts with L/L=1\partial L / \partial L = 1 at the loss and multiplies by each node's local Jacobian as it moves toward the inputs, accumulating gradients. Because many parameters share downstream paths, computing gradients from the output backward reuses work — whereas going forward-per-input would recompute it.

flowchart LR
    X["x"] --> Z["z = Wx + b"]
    Z --> A["a = ReLU(z)"]
    A --> Lhat["ŷ = Va"]
    Lhat --> L["Loss"]
    L -. "dL/dŷ" .-> Lhat
    Lhat -. "dL/da, dL/dV" .-> A
    A -. "dL/dz" .-> Z
    Z -. "dL/dW, dL/dx" .-> X

The dashed arrows are the backward pass: each carries a gradient computed from the one downstream of it times a local derivative.

3The Math

The chain rule is the whole engine. For a scalar loss LL and a layer computing y=f(x)y = f(x), given the upstream gradient Ly\frac{\partial L}{\partial y}:

Lx=(yx) ⁣Ly\frac{\partial L}{\partial x} = \left(\frac{\partial y}{\partial x}\right)^{\!\top} \frac{\partial L}{\partial y}

For a linear layer z=Wx+bz = Wx + b with upstream gradient δ=Lz\delta = \frac{\partial L}{\partial z}:

LW=δx,Lb=δ,Lx=Wδ\frac{\partial L}{\partial W} = \delta\, x^\top, \qquad \frac{\partial L}{\partial b} = \delta, \qquad \frac{\partial L}{\partial x} = W^\top \delta

For an element-wise non-linearity a=ϕ(z)a = \phi(z), the local Jacobian is diagonal, so gradients flow through as an element-wise product:

Lz=Laϕ(z)\frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \odot \phi'(z)

Stacking layers multiplies these terms. That product is exactly why gradients vanish (repeated multiplication by factors <1<1, e.g. saturated sigmoids) or explode (factors >1>1) in deep nets — motivating ReLU, residual connections, and normalization.

4Implementation

A minimal manual backward pass for Linear → ReLU → MSE:

python
1import numpy as np
2
3rng = np.random.default_rng(0)
4x = rng.normal(size=(8, 4))           # batch 8, in-dim 4
5y = rng.normal(size=(8, 2))           # targets, out-dim 2
6W = rng.normal(size=(4, 2)) * 0.1
7b = np.zeros(2)
8
9# ---- forward (cache activations) ----
10z = x @ W + b
11a = np.maximum(0, z)                   # ReLU
12loss = np.mean((a - y) ** 2)
13
14# ---- backward (chain rule) ----
15da = (2 / a.size) * (a - y)           # dL/da from MSE
16dz = da * (z > 0)                     # ReLU': passes grad where z > 0
17dW = x.T @ dz                         # dL/dW = xᵀ δ
18db = dz.sum(axis=0)                   # dL/db = sum δ
19dx = dz @ W.T                         # dL/dx = δ Wᵀ
20
21# gradient check against finite differences
22eps = 1e-5
23W_pert = W.copy(); W_pert[0, 0] += eps
24loss_pert = np.mean((np.maximum(0, x @ W_pert + b) - y) ** 2)
25assert abs((loss_pert - loss) / eps - dW[0, 0]) < 1e-3
5Interview Questions
  1. Conceptual: What does backpropagation actually compute, and how does it relate to the chain rule? (The gradient of the loss w.r.t. all parameters, by applying the chain rule backward through the computational graph.)
  2. Implementation: Why reverse-mode autodiff rather than forward-mode for neural nets? (One scalar loss, many parameters — reverse-mode computes all gradients in ~one backward pass; forward-mode costs one pass per input dimension.)
  3. Applied: What must the forward pass store for the backward pass to work? (The intermediate activations / inputs to each op — e.g., z for ReLU', x for the weight gradient.)
  4. Systems-level: What is gradient checkpointing and what does it trade? (Recompute some activations during the backward pass instead of storing them — saves memory at the cost of extra compute.)
  5. Failure modes: Why do gradients vanish or explode in deep networks, in terms of backprop? (The backward pass multiplies many local Jacobians; products of factors <1 shrink toward zero, >1 blow up — fixed by ReLU, residuals, normalization.)
6Retrieval Check

From memory: write the three gradients for a linear layer (dW, db, dx) given upstream δ, and the ReLU backward rule. Explain in one sentence why reverse-mode is the right choice and what causes vanishing gradients. Check against Stages 3–5.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts