LLMs
LoRA and PEFT
LoRA (Low-Rank Adaptation) explained — freeze the base model and train a tiny low-rank ΔW = BA, why it works, the parameter savings, QLoRA, and zero-latency merging — with code.
LoRA (Low-Rank Adaptation) is the dominant parameter-efficient fine-tuning (PEFT) method. Instead of updating all of a model's billions of weights — which needs huge memory for gradients and optimizer states — LoRA freezes the base model and trains a tiny pair of low-rank matrices that represent the weight change. You often end up training well under 1% of the parameters while matching full fine-tuning quality on many tasks.
Why it works: the weight update needed to adapt a big pretrained model to a narrow task is empirically low-rank — it lives in a small subspace, so a rank- approximation captures it. The practical wins: you can fine-tune a large model on a single GPU, keep one frozen base with many small swappable adapters (one per task), and merge the adapter into the weights at inference for zero added latency. The frame to hold: LoRA learns with tiny, leaving untouched.
A linear layer normally computes . LoRA adds a parallel low-rank branch: , where projects down to a small rank and projects back up. The base is frozen; only and get gradients. Because is small (often 8–64), and together are a rounding error in size next to .
flowchart LR
X["input x"] --> W["frozen W (no grad)"]
X --> A["A: down-project to rank r"]
A --> B["B: up-project back"]
W --> S["sum"]
B --> S
S --> H["output h = Wx + BAx"]
At inference you can fold into (), so the deployed model is exactly the same shape and speed as the original.
For a frozen weight , LoRA reparameterizes the update as a low-rank product:
is a scaling constant (the effective update is ). Initialization matters: is random (e.g. Gaussian) and , so at the start and the model is exactly the pretrained base — training begins from the known-good point and only departs as needed.
Parameter count: full fine-tuning trains params per matrix; LoRA trains . For , : that's vs M — a ~256× reduction, per matrix. QLoRA pushes this further by quantizing the frozen base to 4-bit, so even the frozen weights cost little memory, enabling fine-tuning of very large models on a single GPU.
1import torch
2from torch import Tensor, nn
3
4
5class LoRALinear(nn.Module):
6 def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16) -> None:
7 super().__init__()
8 self.base = base
9 for p in self.base.parameters():
10 p.requires_grad_(False) # freeze the pretrained weights
11 d_out, d_in = base.weight.shape
12 self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
13 self.B = nn.Parameter(torch.zeros(d_out, r)) # B = 0 -> starts as base
14 self.scale = alpha / r
15
16 def forward(self, x: Tensor) -> Tensor:
17 return self.base(x) + self.scale * (x @ self.A.T @ self.B.T)
18
19
20layer = LoRALinear(nn.Linear(4096, 4096), r=8)
21trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
22total = sum(p.numel() for p in layer.parameters())
23print(f"trainable {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
24# trainable 65,536 / 16,842,752 (0.39%)- Conceptual: Why can a low-rank update match full fine-tuning? (The task-adaptation update to a large pretrained model is empirically low-rank — it lives in a small subspace a rank-r product can capture.)
- Implementation: Why initialize B to zero? (So ΔW = BA = 0 at the start — training begins exactly from the pretrained model and only deviates as needed, which is stable.)
- Applied: Roughly how many parameters does LoRA train versus full fine-tuning for a d×k matrix? (r(d+k) vs d·k — often <1%, e.g. ~256× fewer for d=k=4096, r=8.)
- Systems-level: What does QLoRA add, and why does it matter? (It 4-bit-quantizes the frozen base so even the frozen weights are cheap to hold — enabling fine-tuning of very large models on one GPU.)
- Failure modes: Does LoRA add inference latency? (No — you can merge BA into W (W' = W + (α/r)BA), giving an identical-shape model with zero extra cost.)
Without looking: write the LoRA reparameterization with shapes of A and B, explain why B is initialized to zero, and give the parameter count versus full fine-tuning. Then state how to remove the inference overhead. Check against Stage 3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
Deep Learning
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
LLMs
RLHF (Reinforcement Learning from Human Feedback)
Reinforcement Learning from Human Feedback explained — SFT, a reward model from preference comparisons, and PPO with a KL penalty — plus reward hacking and how DPO simplifies the pipeline.
Deep Learning
Attention Mechanisms
How scaled dot-product and multi-head attention work — the soft key-value lookup at the heart of every Transformer — with the math, runnable PyTorch, and calibrated interview questions.