Classical ML
Logistic Regression
The workhorse linear classifier — sigmoid of a linear score, trained with cross-entropy — why not MSE, the decision boundary, and the softmax extension, with from-scratch code.
Logistic regression is the workhorse linear classifier — and the conceptual seed of a neural network's output layer. It models the probability of a binary outcome as the sigmoid of a linear combination of the features, and it's trained by minimizing cross-entropy (log loss). Despite the name, it's classification, not regression.
It matters far beyond its own use: a single neuron with a sigmoid is logistic regression, and the softmax output layer of every classifier is its multiclass generalization. The frame to hold: take a linear score , squash it to a probability with the sigmoid, and train with cross-entropy so predicted probabilities match the labels. Interviewers use it to test whether you understand why cross-entropy (not MSE), what the decision boundary looks like, and how regularization fits in.
The linear part draws a hyperplane through feature space. Points on one side score positive, the other negative. The sigmoid turns that raw score into a probability between 0 and 1, steeply near the boundary and saturating far from it. Predict class 1 when (i.e. when the linear score is positive).
flowchart LR
X["Features x"] --> Z["z = w·x + b (linear score)"]
Z --> S["sigmoid(z) -> probability in [0,1]"]
S --> D["predict 1 if p >= 0.5"]
The decision boundary is linear — logistic regression can only separate classes a straight hyperplane can divide (feature engineering or kernels are needed for nonlinear boundaries).
The model predicts:
Training minimizes binary cross-entropy over examples with labels :
A clean result falls out: the gradient with respect to the weights is
— prediction error times input, the same elegant form as a linear layer under cross-entropy. Why cross-entropy and not MSE? With the sigmoid, MSE is non-convex and its gradient vanishes when predictions are confidently wrong (the sigmoid saturates); cross-entropy is convex in the weights and keeps gradients strong, so it trains reliably. Add L2 regularization () to shrink weights and curb overfitting.
1import numpy as np
2
3rng = np.random.default_rng(0)
4X = rng.normal(size=(200, 3))
5true_w = np.array([2.0, -1.0, 0.5])
6y = (X @ true_w + rng.normal(0, 0.5, 200) > 0).astype(float)
7
8def sigmoid(z): return 1 / (1 + np.exp(-z))
9
10w, b, lr = np.zeros(3), 0.0, 0.1
11for _ in range(2000):
12 p = sigmoid(X @ w + b)
13 grad_w = X.T @ (p - y) / len(y) # (p - y)·x — see Stage 3
14 grad_b = (p - y).mean()
15 w -= lr * grad_w
16 b -= lr * grad_b
17
18acc = ((sigmoid(X @ w + b) >= 0.5) == y).mean()
19print(f"accuracy {acc:.2f}, weights {w.round(2)}")- Conceptual: Why is it called "regression" if it does classification? (It regresses the log-odds — a linear model of log(p/(1−p)) — then thresholds the resulting probability.)
- Implementation: Why train with cross-entropy instead of MSE? (With a sigmoid, MSE is non-convex and its gradient vanishes on confident-wrong predictions; cross-entropy is convex and keeps gradients healthy.)
- Applied: What does the decision boundary look like, and what's the limitation? (A linear hyperplane — it can't separate classes that aren't linearly separable without feature engineering or a kernel.)
- Systems-level: How do you extend logistic regression to more than two classes? (Softmax (multinomial) regression — generalize the sigmoid to a softmax over class logits with categorical cross-entropy.)
- Failure modes: How can you interpret the learned coefficients? (Each weight is the change in log-odds per unit change in that feature; sign and magnitude indicate direction and strength of influence, assuming scaled features.)
Without looking: write the sigmoid model, the binary cross-entropy loss, and the weight gradient . Explain in one sentence why cross-entropy beats MSE here. Check against Stage 3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
Classical ML
Bias-Variance Tradeoff
The exact decomposition of expected error into bias, variance, and irreducible noise — how to diagnose under- vs. overfitting, with intuition, math, and a runnable demo.
Optimization
Gradient Descent (SGD, Momentum, Adam)
SGD, momentum, and Adam explained — the update rules, why mini-batching wins, Adam's bias correction, and when plain SGD generalizes better — with from-scratch implementations.
Classical ML
Gradient Boosting & XGBoost
How gradient boosting builds a strong model from sequential weak trees fit to the negative gradient — boosting vs. bagging, the learning rate, and why XGBoost wins — with code.