Skip to content

Classical ML

Gradient Boosting & XGBoost

How gradient boosting builds a strong model from sequential weak trees fit to the negative gradient — boosting vs. bagging, the learning rate, and why XGBoost wins — with code.

8 min readReviewed May 2026

1Big Picture

Gradient boosting builds a strong predictor out of many weak ones — usually shallow decision trees — added sequentially, where each new tree corrects the errors of the ensemble so far. It's the algorithm behind XGBoost, LightGBM, and CatBoost, and it dominates tabular-data competitions and production ML where the data isn't images or text.

The key idea: each tree is fit to the negative gradient of the loss with respect to the current predictions — for squared error, that's just the residuals. Add the new tree (scaled by a learning rate) and repeat. The frame to hold: start with a rough prediction, then repeatedly fit a small model to what the current ensemble gets wrong, nudging predictions toward the truth. Interviewers contrast it with bagging/random forests, probe the role of the learning rate, and check the gradient connection.

2Intuition + Visual

Boosting is sequential error-correction. The first model makes a crude prediction; you compute its residuals (what it missed); a second small tree predicts those residuals; you add a fraction of it to the prediction; recompute residuals; repeat. Each step chips away at the remaining error. Contrast with bagging (random forests), which trains many trees independently in parallel and averages them to reduce variance — boosting reduces bias by building dependent, corrective trees.

flowchart LR
    F0["F_0: initial guess (e.g. mean)"] --> R["residuals = y − F_m"]
    R --> H["fit small tree h to residuals"]
    H --> U["F_(m+1) = F_m + lr · h"]
    U --> R
3The Math

Build an additive model stage by stage. Starting from a constant F0F_0, at step mm:

Fm(x)=Fm1(x)+ηhm(x)F_m(x) = F_{m-1}(x) + \eta\, h_m(x)

where η\eta is the learning rate (shrinkage) and hmh_m is the new weak learner. Gradient boosting fits hmh_m to the negative gradient of the loss LL evaluated at the current predictions — the "pseudo-residuals":

rim=[L(yi,F(xi))F(xi)]F=Fm1r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}

For squared-error loss L=12(yF)2L = \tfrac{1}{2}(y - F)^2, this gradient is exactly rim=yiFm1(xi)r_{im} = y_i - F_{m-1}(x_i) — the ordinary residuals, which is why "fit the next tree to the residuals" is the intuition. Smaller η\eta means each tree contributes less, so you need more trees but generalize better. XGBoost extends this with a second-order (Newton) approximation using gradients and Hessians, plus explicit L1/L2L_1/L_2 regularization on leaf weights and tree complexity — which is why it's both accurate and resistant to overfitting. Regularize with learning rate, tree depth, subsampling of rows/columns, and the number of trees (early stopping).

4Implementation
python
1import numpy as np
2from sklearn.tree import DecisionTreeRegressor
3
4rng = np.random.default_rng(0)
5X = rng.uniform(-3, 3, size=(300, 1))
6y = np.sin(X[:, 0]) + rng.normal(0, 0.1, 300)
7
8lr, n_trees = 0.1, 100
9pred = np.full(len(y), y.mean())            # F_0 = mean
10trees = []
11for _ in range(n_trees):
12    residuals = y - pred                     # = negative gradient for sq. loss
13    tree = DecisionTreeRegressor(max_depth=2).fit(X, residuals)  # weak learner
14    pred += lr * tree.predict(X)             # F_(m+1) = F_m + lr·h
15    trees.append(tree)
16
17mse = np.mean((y - pred) ** 2)
18print(f"train MSE after {n_trees} trees: {mse:.3f}")
5Interview Questions
  1. Conceptual: How does boosting differ from bagging / random forests? (Boosting builds trees sequentially, each correcting the last (reduces bias); bagging trains trees independently in parallel and averages them (reduces variance).)
  2. Implementation: What does each new tree fit in gradient boosting? (The negative gradient of the loss at current predictions — the residuals, for squared-error loss.)
  3. Applied: What does the learning rate control, and what's the tradeoff? (Shrinkage — how much each tree contributes. Smaller means slower learning, more trees needed, but better generalization.)
  4. Systems-level: Why is XGBoost both accurate and fast? (Second-order (gradient + Hessian) optimization, regularization on leaf weights, and engineering: histogram-based splits, sparsity awareness, and parallelized tree construction.)
  5. Failure modes: Why can gradient boosting overfit, and how do you prevent it? (It keeps reducing training error tree by tree; control with learning rate, shallow trees, row/column subsampling, L1/L2 regularization, and early stopping on a validation set.)
6Retrieval Check

Without looking: write the additive update Fm=Fm1+ηhmF_m = F_{m-1} + \eta h_m, state what each tree fits (the negative gradient / residuals), and contrast boosting with bagging in one sentence. Check against Stage 3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts