What is LoRA (Low-Rank Adaptation)?

A parameter-efficient fine-tuning method that freezes the pretrained weights and trains a small low-rank update ΔW = (α/r)·BA, training under ~1% of parameters while matching full fine-tuning on many tasks.

Why is the LoRA B matrix initialized to zero?

So ΔW = BA = 0 at the start. Training begins exactly from the pretrained model and only departs as needed, which is stable.

QLoRA quantizes the frozen base model to 4-bit so even the frozen weights are cheap to hold in memory, enabling fine-tuning of very large models on a single GPU.

LoRA Explained: Low-Rank Adaptation (ML Interview)

Q: Why can a low-rank update match full fine-tuning?

The weight change needed to adapt a large pretrained model to a narrow task is empirically low-rank — it lives in a small subspace that a rank-r product BA can capture.

LoRA (Low-Rank Adaptation) explained — freeze the base model and train a tiny low-rank ΔW = BA, why it works, the parameter savings, QLoRA, and zero-latency merging — with code.

1Big Picture

LoRA (Low-Rank Adaptation) is the dominant parameter-efficient fine-tuning (PEFT) method. Instead of updating all of a model's billions of weights — which needs huge memory for gradients and optimizer states — LoRA freezes the base model and trains a tiny pair of low-rank matrices that represent the weight change. You often end up training well under 1% of the parameters while matching full fine-tuning quality on many tasks.

Why it works: the weight update needed to adapt a big pretrained model to a narrow task is empirically low-rank — it lives in a small subspace, so a rank- $r$ approximation captures it. The practical wins: you can fine-tune a large model on a single GPU, keep one frozen base with many small swappable adapters (one per task), and merge the adapter into the weights at inference for zero added latency. The frame to hold: LoRA learns $\Delta W \approx BA$ with $A,B$ tiny, leaving $W$ untouched.

2Intuition + Visual

A linear layer normally computes $Wx$ . LoRA adds a parallel low-rank branch: $h = Wx + BAx$ , where $A$ projects down to a small rank $r$ and $B$ projects back up. The base $W$ is frozen; only $A$ and $B$ get gradients. Because $r$ is small (often 8–64), $A$ and $B$ together are a rounding error in size next to $W$ .

At inference you can fold $BA$ into $W$ ( $W' = W + BA$ ), so the deployed model is exactly the same shape and speed as the original.

3The Math

For a frozen weight $W \in \mathbb{R}^{d \times k}$ , LoRA reparameterizes the update as a low-rank product:

W' = W + \Delta W = W + \frac{\alpha}{r}\, B A, \qquad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll \min(d, k)

$\alpha$ is a scaling constant (the effective update is $\frac{\alpha}{r} BA$ ). Initialization matters: $A$ is random (e.g. Gaussian) and $B = 0$ , so at the start $\Delta W = 0$ and the model is exactly the pretrained base — training begins from the known-good point and only departs as needed.

Parameter count: full fine-tuning trains $d \cdot k$ params per matrix; LoRA trains $r(d + k)$ . For $d=k=4096$ , $r=8$ : that's $65{,}536$ vs $16.7$ M — a ~256× reduction, per matrix. QLoRA pushes this further by quantizing the frozen base to 4-bit, so even the frozen weights cost little memory, enabling fine-tuning of very large models on a single GPU.

4Implementation

python

1import torch
2from torch import Tensor, nn
3
4
5class LoRALinear(nn.Module):
6    def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16) -> None:
7        super().__init__()
8        self.base = base
9        for p in self.base.parameters():
10            p.requires_grad_(False)              # freeze the pretrained weights
11        d_out, d_in = base.weight.shape
12        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
13        self.B = nn.Parameter(torch.zeros(d_out, r))   # B = 0 -> starts as base
14        self.scale = alpha / r
15
16    def forward(self, x: Tensor) -> Tensor:
17        return self.base(x) + self.scale * (x @ self.A.T @ self.B.T)
18
19
20layer = LoRALinear(nn.Linear(4096, 4096), r=8)
21trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
22total = sum(p.numel() for p in layer.parameters())
23print(f"trainable {trainable:,} / {total:,}  ({100*trainable/total:.2f}%)")
24# trainable 65,536 / 16,842,752  (0.39%)

5Interview Questions

Conceptual: Why can a low-rank update match full fine-tuning? (The task-adaptation update to a large pretrained model is empirically low-rank — it lives in a small subspace a rank-r product can capture.)
Implementation: Why initialize B to zero? (So ΔW = BA = 0 at the start — training begins exactly from the pretrained model and only deviates as needed, which is stable.)
Applied: Roughly how many parameters does LoRA train versus full fine-tuning for a d×k matrix? (r(d+k) vs d·k — often <1%, e.g. ~256× fewer for d=k=4096, r=8.)
Systems-level: What does QLoRA add, and why does it matter? (It 4-bit-quantizes the frozen base so even the frozen weights are cheap to hold — enabling fine-tuning of very large models on one GPU.)
Failure modes: Does LoRA add inference latency? (No — you can merge BA into W (W' = W + (α/r)BA), giving an identical-shape model with zero extra cost.)

6Retrieval Check

Without looking: write the LoRA reparameterization with shapes of A and B, explain why B is initialized to zero, and give the parameter count versus full fine-tuning. Then state how to remove the inference overhead. Check against Stage 3.

LoRA and PEFT

Related concepts