Skip to content

Classical ML

Bias-Variance Tradeoff

The exact decomposition of expected error into bias, variance, and irreducible noise — how to diagnose under- vs. overfitting, with intuition, math, and a runnable demo.

8 min readReviewed May 2026

1Big Picture

The bias-variance tradeoff explains why a model generalizes well or badly. Total expected error on unseen data decomposes into three parts: bias (error from wrong assumptions — the model is too simple to capture the signal), variance (error from sensitivity to the particular training set — the model memorizes noise), and irreducible noise (the floor you can't beat). Push model complexity up and bias falls but variance rises; push it down and the reverse happens. Generalization is the art of trading them off.

This is the conceptual backbone behind regularization, model selection, cross-validation, and the whole under/overfitting conversation. Interviewers use it to test whether you can diagnose a model: high training error means bias; a large train-test gap means variance. The frame to hold: you are minimizing total error, not bias or variance alone — and the two pull in opposite directions as complexity changes.

2Intuition + Visual

The dartboard analogy: bias is how far your cluster of shots sits from the bullseye; variance is how spread out the shots are. Four regimes:

  • Low bias, low variance — tight cluster on target (the goal).
  • High bias, low variance — tight cluster, wrong spot (underfit).
  • Low bias, high variance — centered on average but scattered (overfit).
  • High bias, high variance — scattered and off-target (worst case).
flowchart LR
    C["Model complexity rises"] --> B["Bias decreases"]
    C --> V["Variance increases"]
    B --> T["Total error = Bias² + Variance + Noise"]
    V --> T
    T --> S["Sweet spot: minimize total error"]

As complexity rises, total error traces a U-shape: it falls while shrinking bias dominates, hits a minimum, then climbs as growing variance takes over.

3The Math

For a true function y=f(x)+ϵy = f(x) + \epsilon with noise E[ϵ]=0\mathbb{E}[\epsilon]=0, Var(ϵ)=σ2\text{Var}(\epsilon)=\sigma^2, and a model f^(x)\hat{f}(x) trained on a random dataset, the expected squared error at a point xx decomposes exactly as:

E[(yf^(x))2]=(E[f^(x)]f(x))2Bias2+E[(f^(x)E[f^(x)])2]Variance+σ2Irreducible\mathbb{E}\big[(y - \hat{f}(x))^2\big] = \underbrace{\big(\mathbb{E}[\hat{f}(x)] - f(x)\big)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\big]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}}

Reading each term:

  • Bias — how far the average prediction (over many training sets) is from the truth. Simple models (linear fit to a curve) have high bias.
  • Variance — how much the prediction wobbles as the training set changes. Flexible models (a degree-15 polynomial on 20 points) have high variance.
  • Irreducibleσ2\sigma^2, the noise in the data itself; no model can beat it.

The expectation is over random draws of the training set. The tradeoff is structural: you cannot drive both the first two terms to zero with finite data.

4Implementation

A direct demonstration: fit polynomials of increasing degree and watch train error fall while test error makes a U.

python
1import numpy as np
2from numpy.polynomial import polynomial as P
3
4rng = np.random.default_rng(0)
5
6def true_f(x):
7    return np.sin(2 * np.pi * x)
8
9def make_data(n):
10    x = rng.uniform(0, 1, n)
11    y = true_f(x) + rng.normal(0, 0.2, n)   # signal + irreducible noise
12    return x, y
13
14x_train, y_train = make_data(20)
15x_test, y_test = make_data(500)
16
17for degree in (1, 3, 9, 15):
18    coeffs = P.polyfit(x_train, y_train, degree)
19    train_mse = np.mean((P.polyval(x_train, coeffs) - y_train) ** 2)
20    test_mse = np.mean((P.polyval(x_test, coeffs) - y_test) ** 2)
21    print(f"degree {degree:>2}: train={train_mse:.3f}  test={test_mse:.3f}")
22
23# degree  1: high train + high test  -> underfit (high bias)
24# degree  3: low train  + low test   -> sweet spot
25# degree 15: ~0 train   + high test   -> overfit (high variance)
5Interview Questions
  1. Conceptual: Define bias and variance in one sentence each, in terms of training-set randomness. (Bias: error of the average prediction vs. truth. Variance: how much the prediction changes across different training sets.)
  2. Implementation: Given training error and validation error, how do you diagnose bias vs. variance? (High training error → high bias/underfit. Low training error but large train-val gap → high variance/overfit.)
  3. Applied: You're overfitting. Name three levers and which side of the tradeoff each moves. (More data, regularization, or simpler model → reduce variance, slightly raise bias.)
  4. Systems-level: How does cross-validation relate to this decomposition? (It estimates expected test error — averaging over folds approximates the expectation, exposing high-variance models that look great on one split.)
  5. Failure modes: Does the classic U-curve always hold? (Not always — deep, overparameterized nets can show "double descent," where test error falls again past the interpolation threshold. Know the exception exists.)
6Retrieval Check

From memory: write the three-term error decomposition, define each term in words, and state which way bias and variance move as model complexity increases. Then name one real exception to the U-curve. Check against Stages 3 and 5.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts