Deep Learning
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
The Transformer is the architecture underneath modern NLP and most large models. It threw out recurrence entirely and built sequence modeling from self-attention + feed-forward blocks, stacked with residual connections and normalization. The payoff: every position is processed in parallel (no sequential RNN unrolling), and any token can attend to any other in a single step — so training is fast and long-range dependencies are easy.
Three configurations matter for interviews: encoder-only (BERT — bidirectional, for understanding/classification), decoder-only (GPT — causal/autoregressive, for generation), and encoder-decoder (T5, original Transformer — for translation/seq2seq). The frame to hold: a Transformer is N identical blocks stacked, each block doing "mix information across positions (attention), then transform each position independently (FFN)," wrapped in residuals and LayerNorm. Master one block and you understand the whole stack.
One block does two things in sequence. Attention lets every token gather context from every other token — it's the only place information moves between positions. The feed-forward network (FFN) then processes each position independently, adding capacity and non-linearity. Residual connections (x + sublayer(x)) give gradients a clean path and let the network learn refinements rather than full remappings; LayerNorm keeps activations well-scaled.
flowchart TB
X["Input + positional encoding"] --> N1["LayerNorm"]
N1 --> A["Multi-Head Self-Attention"]
A --> R1["Add residual"]
X --> R1
R1 --> N2["LayerNorm"]
N2 --> F["Feed-Forward (MLP)"]
F --> R2["Add residual"]
R1 --> R2
R2 --> O["To next block"]
Because attention is permutation-invariant, a Transformer has no inherent notion of order — positional encoding is injected at the input so the model knows token positions.
A modern pre-norm Transformer block, for input :
The feed-forward network is a two-layer MLP applied position-wise, usually expanding to :
where is a non-linearity (ReLU, or GELU in most LLMs), , .
Pre-norm vs. post-norm: the original paper put LayerNorm after the residual (). Pre-norm (LN inside, before the sublayer) trains far more stably at depth because the residual path stays an identity — this is why nearly all large models use pre-norm. The FFN holds roughly parameters per block and dominates the parameter count; attention contributes from the Q/K/V/O projections.
1import torch
2from torch import Tensor, nn
3
4
5class TransformerBlock(nn.Module):
6 """Pre-norm decoder block: causal self-attention + position-wise FFN."""
7
8 def __init__(self, d_model: int, n_heads: int, mlp_ratio: int = 4, p: float = 0.1) -> None:
9 super().__init__()
10 self.ln1 = nn.LayerNorm(d_model)
11 self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=p, batch_first=True)
12 self.ln2 = nn.LayerNorm(d_model)
13 self.ffn = nn.Sequential(
14 nn.Linear(d_model, mlp_ratio * d_model),
15 nn.GELU(),
16 nn.Linear(mlp_ratio * d_model, d_model),
17 nn.Dropout(p),
18 )
19
20 def forward(self, x: Tensor, attn_mask: Tensor | None = None) -> Tensor:
21 h = self.ln1(x)
22 # causal mask makes this a decoder block (each token sees only the past)
23 attn_out, _ = self.attn(h, h, h, attn_mask=attn_mask, need_weights=False)
24 x = x + attn_out # residual 1
25 x = x + self.ffn(self.ln2(x)) # residual 2 (pre-norm)
26 return x
27
28
29block = TransformerBlock(d_model=512, n_heads=8)
30seq = torch.randn(2, 16, 512)
31causal = torch.triu(torch.full((16, 16), float("-inf")), diagonal=1)
32assert block(seq, attn_mask=causal).shape == (2, 16, 512)- Conceptual: What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers, and give one model for each. (BERT = encoder/bidirectional; GPT = decoder/causal; T5 = encoder-decoder/seq2seq.)
- Implementation: Why are residual connections essential in a deep Transformer? (They give gradients an identity path, preventing vanishing gradients and letting each block learn a refinement.)
- Applied: Why pre-norm over post-norm in large models? (Pre-norm keeps the residual stream an identity, making deep stacks trainable without careful warmup; post-norm is unstable at depth.)
- Systems-level: Where do most of a Transformer's parameters and FLOPs live? (The FFN — ~8d² params per block — dominates parameters; attention's quadratic cost in sequence length dominates compute for long contexts.)
- Failure modes: Why does a Transformer need positional encoding at all? (Self-attention is permutation-invariant — without positional information it can't distinguish token order.)
From memory: draw one pre-norm Transformer block, write the two residual equations, state what the FFN does that attention doesn't, and name the three encoder/decoder configurations with an example each. Check against Stages 2–3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
Deep Learning
Attention Mechanisms
How scaled dot-product and multi-head attention work — the soft key-value lookup at the heart of every Transformer — with the math, runnable PyTorch, and calibrated interview questions.
LLMs
KV Cache
How the KV cache makes autoregressive LLM decoding affordable — what it stores and why reuse is valid, the memory cost, why decoding is memory-bandwidth-bound, and how MQA/GQA shrink it — with code.
Deep Learning
Batch Normalization
What Batch Norm normalizes and why, the critical train-vs-inference distinction, BN vs. Layer Norm, with the math and a from-scratch PyTorch implementation.