Skip to content

Classical ML

Logistic Regression

The workhorse linear classifier — sigmoid of a linear score, trained with cross-entropy — why not MSE, the decision boundary, and the softmax extension, with from-scratch code.

8 min readReviewed May 2026

1Big Picture

Logistic regression is the workhorse linear classifier — and the conceptual seed of a neural network's output layer. It models the probability of a binary outcome as the sigmoid of a linear combination of the features, and it's trained by minimizing cross-entropy (log loss). Despite the name, it's classification, not regression.

It matters far beyond its own use: a single neuron with a sigmoid is logistic regression, and the softmax output layer of every classifier is its multiclass generalization. The frame to hold: take a linear score wx+bw \cdot x + b, squash it to a probability with the sigmoid, and train with cross-entropy so predicted probabilities match the labels. Interviewers use it to test whether you understand why cross-entropy (not MSE), what the decision boundary looks like, and how regularization fits in.

2Intuition + Visual

The linear part wx+bw \cdot x + b draws a hyperplane through feature space. Points on one side score positive, the other negative. The sigmoid turns that raw score into a probability between 0 and 1, steeply near the boundary and saturating far from it. Predict class 1 when p0.5p \ge 0.5 (i.e. when the linear score is positive).

flowchart LR
    X["Features x"] --> Z["z = w·x + b (linear score)"]
    Z --> S["sigmoid(z) -> probability in [0,1]"]
    S --> D["predict 1 if p >= 0.5"]

The decision boundary is linear — logistic regression can only separate classes a straight hyperplane can divide (feature engineering or kernels are needed for nonlinear boundaries).

3The Math

The model predicts:

p=σ(wx+b),σ(z)=11+ezp = \sigma(w \cdot x + b), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}

Training minimizes binary cross-entropy over NN examples with labels y{0,1}y \in \{0,1\}:

L=1Ni=1N[yilogpi+(1yi)log(1pi)]\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \Big[\, y_i \log p_i + (1 - y_i)\log(1 - p_i) \,\Big]

A clean result falls out: the gradient with respect to the weights is

Lw=1Ni=1N(piyi)xi\frac{\partial \mathcal{L}}{\partial w} = \frac{1}{N}\sum_{i=1}^{N} (p_i - y_i)\, x_i

— prediction error times input, the same elegant form as a linear layer under cross-entropy. Why cross-entropy and not MSE? With the sigmoid, MSE is non-convex and its gradient vanishes when predictions are confidently wrong (the sigmoid saturates); cross-entropy is convex in the weights and keeps gradients strong, so it trains reliably. Add L2 regularization (+λw2+\lambda \lVert w \rVert^2) to shrink weights and curb overfitting.

4Implementation
python
1import numpy as np
2
3rng = np.random.default_rng(0)
4X = rng.normal(size=(200, 3))
5true_w = np.array([2.0, -1.0, 0.5])
6y = (X @ true_w + rng.normal(0, 0.5, 200) > 0).astype(float)
7
8def sigmoid(z): return 1 / (1 + np.exp(-z))
9
10w, b, lr = np.zeros(3), 0.0, 0.1
11for _ in range(2000):
12    p = sigmoid(X @ w + b)
13    grad_w = X.T @ (p - y) / len(y)         # (p - y)·x  — see Stage 3
14    grad_b = (p - y).mean()
15    w -= lr * grad_w
16    b -= lr * grad_b
17
18acc = ((sigmoid(X @ w + b) >= 0.5) == y).mean()
19print(f"accuracy {acc:.2f}, weights {w.round(2)}")
5Interview Questions
  1. Conceptual: Why is it called "regression" if it does classification? (It regresses the log-odds — a linear model of log(p/(1−p)) — then thresholds the resulting probability.)
  2. Implementation: Why train with cross-entropy instead of MSE? (With a sigmoid, MSE is non-convex and its gradient vanishes on confident-wrong predictions; cross-entropy is convex and keeps gradients healthy.)
  3. Applied: What does the decision boundary look like, and what's the limitation? (A linear hyperplane — it can't separate classes that aren't linearly separable without feature engineering or a kernel.)
  4. Systems-level: How do you extend logistic regression to more than two classes? (Softmax (multinomial) regression — generalize the sigmoid to a softmax over class logits with categorical cross-entropy.)
  5. Failure modes: How can you interpret the learned coefficients? (Each weight is the change in log-odds per unit change in that feature; sign and magnitude indicate direction and strength of influence, assuming scaled features.)
6Retrieval Check

Without looking: write the sigmoid model, the binary cross-entropy loss, and the weight gradient (py)x(p - y)x. Explain in one sentence why cross-entropy beats MSE here. Check against Stage 3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts