Why is logistic regression called regression if it does classification?

It regresses the log-odds — a linear model of log(p/(1−p)) — then thresholds the resulting probability to make a class decision.

Why train logistic regression with cross-entropy instead of MSE?

With a sigmoid, MSE is non-convex and its gradient vanishes on confidently-wrong predictions; cross-entropy is convex in the weights and keeps gradients healthy.

What does the logistic regression decision boundary look like?

A linear hyperplane — it can only separate classes that are linearly separable, unless you add feature engineering or a kernel.

How do you extend logistic regression to multiple classes?

Use softmax (multinomial) regression — generalize the sigmoid to a softmax over per-class logits trained with categorical cross-entropy.

Logistic Regression Explained (ML Interview)

The workhorse linear classifier — sigmoid of a linear score, trained with cross-entropy — why not MSE, the decision boundary, and the softmax extension, with from-scratch code.

1Big Picture

Logistic regression is the workhorse linear classifier — and the conceptual seed of a neural network's output layer. It models the probability of a binary outcome as the sigmoid of a linear combination of the features, and it's trained by minimizing cross-entropy (log loss). Despite the name, it's classification, not regression.

It matters far beyond its own use: a single neuron with a sigmoid is logistic regression, and the softmax output layer of every classifier is its multiclass generalization. The frame to hold: take a linear score $w \cdot x + b$ , squash it to a probability with the sigmoid, and train with cross-entropy so predicted probabilities match the labels. Interviewers use it to test whether you understand why cross-entropy (not MSE), what the decision boundary looks like, and how regularization fits in.

2Intuition + Visual

The linear part $w \cdot x + b$ draws a hyperplane through feature space. Points on one side score positive, the other negative. The sigmoid turns that raw score into a probability between 0 and 1, steeply near the boundary and saturating far from it. Predict class 1 when $p \ge 0.5$ (i.e. when the linear score is positive).

The decision boundary is linear — logistic regression can only separate classes a straight hyperplane can divide (feature engineering or kernels are needed for nonlinear boundaries).

3The Math

The model predicts:

p = \sigma(w \cdot x + b), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}

Training minimizes binary cross-entropy over $N$ examples with labels $y \in \{0,1\}$ :

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \Big[\, y_i \log p_i + (1 - y_i)\log(1 - p_i) \,\Big]

A clean result falls out: the gradient with respect to the weights is

\frac{\partial \mathcal{L}}{\partial w} = \frac{1}{N}\sum_{i=1}^{N} (p_i - y_i)\, x_i

— prediction error times input, the same elegant form as a linear layer under cross-entropy. Why cross-entropy and not MSE? With the sigmoid, MSE is non-convex and its gradient vanishes when predictions are confidently wrong (the sigmoid saturates); cross-entropy is convex in the weights and keeps gradients strong, so it trains reliably. Add L2 regularization ( $+\lambda \lVert w \rVert^2$ ) to shrink weights and curb overfitting.

4Implementation

python

1import numpy as np
2
3rng = np.random.default_rng(0)
4X = rng.normal(size=(200, 3))
5true_w = np.array([2.0, -1.0, 0.5])
6y = (X @ true_w + rng.normal(0, 0.5, 200) > 0).astype(float)
7
8def sigmoid(z): return 1 / (1 + np.exp(-z))
9
10w, b, lr = np.zeros(3), 0.0, 0.1
11for _ in range(2000):
12    p = sigmoid(X @ w + b)
13    grad_w = X.T @ (p - y) / len(y)         # (p - y)·x  — see Stage 3
14    grad_b = (p - y).mean()
15    w -= lr * grad_w
16    b -= lr * grad_b
17
18acc = ((sigmoid(X @ w + b) >= 0.5) == y).mean()
19print(f"accuracy {acc:.2f}, weights {w.round(2)}")

5Interview Questions

Conceptual: Why is it called "regression" if it does classification? (It regresses the log-odds — a linear model of log(p/(1−p)) — then thresholds the resulting probability.)
Implementation: Why train with cross-entropy instead of MSE? (With a sigmoid, MSE is non-convex and its gradient vanishes on confident-wrong predictions; cross-entropy is convex and keeps gradients healthy.)
Applied: What does the decision boundary look like, and what's the limitation? (A linear hyperplane — it can't separate classes that aren't linearly separable without feature engineering or a kernel.)
Systems-level: How do you extend logistic regression to more than two classes? (Softmax (multinomial) regression — generalize the sigmoid to a softmax over class logits with categorical cross-entropy.)
Failure modes: How can you interpret the learned coefficients? (Each weight is the change in log-odds per unit change in that feature; sign and magnitude indicate direction and strength of influence, assuming scaled features.)

6Retrieval Check

Without looking: write the sigmoid model, the binary cross-entropy loss, and the weight gradient $(p - y)x$ . Explain in one sentence why cross-entropy beats MSE here. Check against Stage 3.

Logistic Regression

Related concepts