What is the bias-variance tradeoff?

Expected test error decomposes into bias² (error from overly simple assumptions), variance (sensitivity to the training set), and irreducible noise. Increasing model complexity lowers bias but raises variance, so you trade one for the other.

How do you diagnose bias vs. variance from error metrics?

High training error means high bias (underfitting). Low training error but a large gap to validation error means high variance (overfitting).

How do you reduce variance (overfitting)?

More training data, regularization (L1/L2, dropout), or a simpler model. Each reduces variance at the cost of slightly higher bias.

Does the bias-variance U-curve always hold?

Not always. Deeply overparameterized networks can show 'double descent,' where test error decreases again past the interpolation threshold — a known exception to the classic U-shape.

Bias-Variance Tradeoff Explained (ML Interview)

The exact decomposition of expected error into bias, variance, and irreducible noise — how to diagnose under- vs. overfitting, with intuition, math, and a runnable demo.

1Big Picture

The bias-variance tradeoff explains why a model generalizes well or badly. Total expected error on unseen data decomposes into three parts: bias (error from wrong assumptions — the model is too simple to capture the signal), variance (error from sensitivity to the particular training set — the model memorizes noise), and irreducible noise (the floor you can't beat). Push model complexity up and bias falls but variance rises; push it down and the reverse happens. Generalization is the art of trading them off.

This is the conceptual backbone behind regularization, model selection, cross-validation, and the whole under/overfitting conversation. Interviewers use it to test whether you can diagnose a model: high training error means bias; a large train-test gap means variance. The frame to hold: you are minimizing total error, not bias or variance alone — and the two pull in opposite directions as complexity changes.

2Intuition + Visual

The dartboard analogy: bias is how far your cluster of shots sits from the bullseye; variance is how spread out the shots are. Four regimes:

Low bias, low variance — tight cluster on target (the goal).
High bias, low variance — tight cluster, wrong spot (underfit).
Low bias, high variance — centered on average but scattered (overfit).
High bias, high variance — scattered and off-target (worst case).

As complexity rises, total error traces a U-shape: it falls while shrinking bias dominates, hits a minimum, then climbs as growing variance takes over.

3The Math

For a true function $y = f(x) + \epsilon$ with noise $\mathbb{E}[\epsilon]=0$ , $\text{Var}(\epsilon)=\sigma^2$ , and a model $\hat{f}(x)$ trained on a random dataset, the expected squared error at a point $x$ decomposes exactly as:

\mathbb{E}\big[(y - \hat{f}(x))^2\big] = \underbrace{\big(\mathbb{E}[\hat{f}(x)] - f(x)\big)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\big]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}}

Reading each term:

Bias — how far the average prediction (over many training sets) is from the truth. Simple models (linear fit to a curve) have high bias.
Variance — how much the prediction wobbles as the training set changes. Flexible models (a degree-15 polynomial on 20 points) have high variance.
Irreducible — $\sigma^2$ , the noise in the data itself; no model can beat it.

The expectation is over random draws of the training set. The tradeoff is structural: you cannot drive both the first two terms to zero with finite data.

4Implementation

A direct demonstration: fit polynomials of increasing degree and watch train error fall while test error makes a U.

python

1import numpy as np
2from numpy.polynomial import polynomial as P
3
4rng = np.random.default_rng(0)
5
6def true_f(x):
7    return np.sin(2 * np.pi * x)
8
9def make_data(n):
10    x = rng.uniform(0, 1, n)
11    y = true_f(x) + rng.normal(0, 0.2, n)   # signal + irreducible noise
12    return x, y
13
14x_train, y_train = make_data(20)
15x_test, y_test = make_data(500)
16
17for degree in (1, 3, 9, 15):
18    coeffs = P.polyfit(x_train, y_train, degree)
19    train_mse = np.mean((P.polyval(x_train, coeffs) - y_train) ** 2)
20    test_mse = np.mean((P.polyval(x_test, coeffs) - y_test) ** 2)
21    print(f"degree {degree:>2}: train={train_mse:.3f}  test={test_mse:.3f}")
22
23# degree  1: high train + high test  -> underfit (high bias)
24# degree  3: low train  + low test   -> sweet spot
25# degree 15: ~0 train   + high test   -> overfit (high variance)

5Interview Questions

Conceptual: Define bias and variance in one sentence each, in terms of training-set randomness. (Bias: error of the average prediction vs. truth. Variance: how much the prediction changes across different training sets.)
Implementation: Given training error and validation error, how do you diagnose bias vs. variance? (High training error → high bias/underfit. Low training error but large train-val gap → high variance/overfit.)
Applied: You're overfitting. Name three levers and which side of the tradeoff each moves. (More data, regularization, or simpler model → reduce variance, slightly raise bias.)
Systems-level: How does cross-validation relate to this decomposition? (It estimates expected test error — averaging over folds approximates the expectation, exposing high-variance models that look great on one split.)
Failure modes: Does the classic U-curve always hold? (Not always — deep, overparameterized nets can show "double descent," where test error falls again past the interpolation threshold. Know the exception exists.)

6Retrieval Check

From memory: write the three-term error decomposition, define each term in words, and state which way bias and variance move as model complexity increases. Then name one real exception to the U-curve. Check against Stages 3 and 5.

Bias-Variance Tradeoff

Related concepts