Skip to content

LLMs

LoRA and PEFT

LoRA (Low-Rank Adaptation) explained — freeze the base model and train a tiny low-rank ΔW = BA, why it works, the parameter savings, QLoRA, and zero-latency merging — with code.

8 min readReviewed May 2026

1Big Picture

LoRA (Low-Rank Adaptation) is the dominant parameter-efficient fine-tuning (PEFT) method. Instead of updating all of a model's billions of weights — which needs huge memory for gradients and optimizer states — LoRA freezes the base model and trains a tiny pair of low-rank matrices that represent the weight change. You often end up training well under 1% of the parameters while matching full fine-tuning quality on many tasks.

Why it works: the weight update needed to adapt a big pretrained model to a narrow task is empirically low-rank — it lives in a small subspace, so a rank-rr approximation captures it. The practical wins: you can fine-tune a large model on a single GPU, keep one frozen base with many small swappable adapters (one per task), and merge the adapter into the weights at inference for zero added latency. The frame to hold: LoRA learns ΔWBA\Delta W \approx BA with A,BA,B tiny, leaving WW untouched.

2Intuition + Visual

A linear layer normally computes WxWx. LoRA adds a parallel low-rank branch: h=Wx+BAxh = Wx + BAx, where AA projects down to a small rank rr and BB projects back up. The base WW is frozen; only AA and BB get gradients. Because rr is small (often 8–64), AA and BB together are a rounding error in size next to WW.

flowchart LR
    X["input x"] --> W["frozen W (no grad)"]
    X --> A["A: down-project to rank r"]
    A --> B["B: up-project back"]
    W --> S["sum"]
    B --> S
    S --> H["output h = Wx + BAx"]

At inference you can fold BABA into WW (W=W+BAW' = W + BA), so the deployed model is exactly the same shape and speed as the original.

3The Math

For a frozen weight WRd×kW \in \mathbb{R}^{d \times k}, LoRA reparameterizes the update as a low-rank product:

W=W+ΔW=W+αrBA,BRd×r,  ARr×k,  rmin(d,k)W' = W + \Delta W = W + \frac{\alpha}{r}\, B A, \qquad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll \min(d, k)

α\alpha is a scaling constant (the effective update is αrBA\frac{\alpha}{r} BA). Initialization matters: AA is random (e.g. Gaussian) and B=0B = 0, so at the start ΔW=0\Delta W = 0 and the model is exactly the pretrained base — training begins from the known-good point and only departs as needed.

Parameter count: full fine-tuning trains dkd \cdot k params per matrix; LoRA trains r(d+k)r(d + k). For d=k=4096d=k=4096, r=8r=8: that's 65,53665{,}536 vs 16.716.7M — a ~256× reduction, per matrix. QLoRA pushes this further by quantizing the frozen base to 4-bit, so even the frozen weights cost little memory, enabling fine-tuning of very large models on a single GPU.

4Implementation
python
1import torch
2from torch import Tensor, nn
3
4
5class LoRALinear(nn.Module):
6    def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16) -> None:
7        super().__init__()
8        self.base = base
9        for p in self.base.parameters():
10            p.requires_grad_(False)              # freeze the pretrained weights
11        d_out, d_in = base.weight.shape
12        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
13        self.B = nn.Parameter(torch.zeros(d_out, r))   # B = 0 -> starts as base
14        self.scale = alpha / r
15
16    def forward(self, x: Tensor) -> Tensor:
17        return self.base(x) + self.scale * (x @ self.A.T @ self.B.T)
18
19
20layer = LoRALinear(nn.Linear(4096, 4096), r=8)
21trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
22total = sum(p.numel() for p in layer.parameters())
23print(f"trainable {trainable:,} / {total:,}  ({100*trainable/total:.2f}%)")
24# trainable 65,536 / 16,842,752  (0.39%)
5Interview Questions
  1. Conceptual: Why can a low-rank update match full fine-tuning? (The task-adaptation update to a large pretrained model is empirically low-rank — it lives in a small subspace a rank-r product can capture.)
  2. Implementation: Why initialize B to zero? (So ΔW = BA = 0 at the start — training begins exactly from the pretrained model and only deviates as needed, which is stable.)
  3. Applied: Roughly how many parameters does LoRA train versus full fine-tuning for a d×k matrix? (r(d+k) vs d·k — often <1%, e.g. ~256× fewer for d=k=4096, r=8.)
  4. Systems-level: What does QLoRA add, and why does it matter? (It 4-bit-quantizes the frozen base so even the frozen weights are cheap to hold — enabling fine-tuning of very large models on one GPU.)
  5. Failure modes: Does LoRA add inference latency? (No — you can merge BA into W (W' = W + (α/r)BA), giving an identical-shape model with zero extra cost.)
6Retrieval Check

Without looking: write the LoRA reparameterization with shapes of A and B, explain why B is initialized to zero, and give the parameter count versus full fine-tuning. Then state how to remove the inference overhead. Check against Stage 3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts