Classical ML
Bias-Variance Tradeoff
The exact decomposition of expected error into bias, variance, and irreducible noise — how to diagnose under- vs. overfitting, with intuition, math, and a runnable demo.
The bias-variance tradeoff explains why a model generalizes well or badly. Total expected error on unseen data decomposes into three parts: bias (error from wrong assumptions — the model is too simple to capture the signal), variance (error from sensitivity to the particular training set — the model memorizes noise), and irreducible noise (the floor you can't beat). Push model complexity up and bias falls but variance rises; push it down and the reverse happens. Generalization is the art of trading them off.
This is the conceptual backbone behind regularization, model selection, cross-validation, and the whole under/overfitting conversation. Interviewers use it to test whether you can diagnose a model: high training error means bias; a large train-test gap means variance. The frame to hold: you are minimizing total error, not bias or variance alone — and the two pull in opposite directions as complexity changes.
The dartboard analogy: bias is how far your cluster of shots sits from the bullseye; variance is how spread out the shots are. Four regimes:
- Low bias, low variance — tight cluster on target (the goal).
- High bias, low variance — tight cluster, wrong spot (underfit).
- Low bias, high variance — centered on average but scattered (overfit).
- High bias, high variance — scattered and off-target (worst case).
flowchart LR
C["Model complexity rises"] --> B["Bias decreases"]
C --> V["Variance increases"]
B --> T["Total error = Bias² + Variance + Noise"]
V --> T
T --> S["Sweet spot: minimize total error"]
As complexity rises, total error traces a U-shape: it falls while shrinking bias dominates, hits a minimum, then climbs as growing variance takes over.
For a true function with noise , , and a model trained on a random dataset, the expected squared error at a point decomposes exactly as:
Reading each term:
- Bias — how far the average prediction (over many training sets) is from the truth. Simple models (linear fit to a curve) have high bias.
- Variance — how much the prediction wobbles as the training set changes. Flexible models (a degree-15 polynomial on 20 points) have high variance.
- Irreducible — , the noise in the data itself; no model can beat it.
The expectation is over random draws of the training set. The tradeoff is structural: you cannot drive both the first two terms to zero with finite data.
A direct demonstration: fit polynomials of increasing degree and watch train error fall while test error makes a U.
1import numpy as np
2from numpy.polynomial import polynomial as P
3
4rng = np.random.default_rng(0)
5
6def true_f(x):
7 return np.sin(2 * np.pi * x)
8
9def make_data(n):
10 x = rng.uniform(0, 1, n)
11 y = true_f(x) + rng.normal(0, 0.2, n) # signal + irreducible noise
12 return x, y
13
14x_train, y_train = make_data(20)
15x_test, y_test = make_data(500)
16
17for degree in (1, 3, 9, 15):
18 coeffs = P.polyfit(x_train, y_train, degree)
19 train_mse = np.mean((P.polyval(x_train, coeffs) - y_train) ** 2)
20 test_mse = np.mean((P.polyval(x_test, coeffs) - y_test) ** 2)
21 print(f"degree {degree:>2}: train={train_mse:.3f} test={test_mse:.3f}")
22
23# degree 1: high train + high test -> underfit (high bias)
24# degree 3: low train + low test -> sweet spot
25# degree 15: ~0 train + high test -> overfit (high variance)- Conceptual: Define bias and variance in one sentence each, in terms of training-set randomness. (Bias: error of the average prediction vs. truth. Variance: how much the prediction changes across different training sets.)
- Implementation: Given training error and validation error, how do you diagnose bias vs. variance? (High training error → high bias/underfit. Low training error but large train-val gap → high variance/overfit.)
- Applied: You're overfitting. Name three levers and which side of the tradeoff each moves. (More data, regularization, or simpler model → reduce variance, slightly raise bias.)
- Systems-level: How does cross-validation relate to this decomposition? (It estimates expected test error — averaging over folds approximates the expectation, exposing high-variance models that look great on one split.)
- Failure modes: Does the classic U-curve always hold? (Not always — deep, overparameterized nets can show "double descent," where test error falls again past the interpolation threshold. Know the exception exists.)
From memory: write the three-term error decomposition, define each term in words, and state which way bias and variance move as model complexity increases. Then name one real exception to the U-curve. Check against Stages 3 and 5.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
Optimization
Gradient Descent (SGD, Momentum, Adam)
SGD, momentum, and Adam explained — the update rules, why mini-batching wins, Adam's bias correction, and when plain SGD generalizes better — with from-scratch implementations.
Deep Learning
Backpropagation
Backpropagation as reverse-mode autodiff — the chain rule over the computational graph, the gradients for a linear layer and ReLU, and why gradients vanish — with a runnable manual backward pass.