Skip to content

Deep Learning

Transformer Architecture

The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.

9 min readReviewed May 2026

1Big Picture

The Transformer is the architecture underneath modern NLP and most large models. It threw out recurrence entirely and built sequence modeling from self-attention + feed-forward blocks, stacked with residual connections and normalization. The payoff: every position is processed in parallel (no sequential RNN unrolling), and any token can attend to any other in a single step — so training is fast and long-range dependencies are easy.

Three configurations matter for interviews: encoder-only (BERT — bidirectional, for understanding/classification), decoder-only (GPT — causal/autoregressive, for generation), and encoder-decoder (T5, original Transformer — for translation/seq2seq). The frame to hold: a Transformer is N identical blocks stacked, each block doing "mix information across positions (attention), then transform each position independently (FFN)," wrapped in residuals and LayerNorm. Master one block and you understand the whole stack.

2Intuition + Visual

One block does two things in sequence. Attention lets every token gather context from every other token — it's the only place information moves between positions. The feed-forward network (FFN) then processes each position independently, adding capacity and non-linearity. Residual connections (x + sublayer(x)) give gradients a clean path and let the network learn refinements rather than full remappings; LayerNorm keeps activations well-scaled.

flowchart TB
    X["Input + positional encoding"] --> N1["LayerNorm"]
    N1 --> A["Multi-Head Self-Attention"]
    A --> R1["Add residual"]
    X --> R1
    R1 --> N2["LayerNorm"]
    N2 --> F["Feed-Forward (MLP)"]
    F --> R2["Add residual"]
    R1 --> R2
    R2 --> O["To next block"]

Because attention is permutation-invariant, a Transformer has no inherent notion of order — positional encoding is injected at the input so the model knows token positions.

3The Math

A modern pre-norm Transformer block, for input xx:

h=x+MHA(LN(x))h = x + \text{MHA}(\text{LN}(x)) y=h+FFN(LN(h))y = h + \text{FFN}(\text{LN}(h))

The feed-forward network is a two-layer MLP applied position-wise, usually expanding to 4dmodel4 d_{\text{model}}:

FFN(z)=W2ϕ(W1z+b1)+b2\text{FFN}(z) = W_2\,\phi(W_1 z + b_1) + b_2

where ϕ\phi is a non-linearity (ReLU, or GELU in most LLMs), W1R4d×dW_1 \in \mathbb{R}^{4d \times d}, W2Rd×4dW_2 \in \mathbb{R}^{d \times 4d}.

Pre-norm vs. post-norm: the original paper put LayerNorm after the residual (LN(x+sublayer(x))\text{LN}(x + \text{sublayer}(x))). Pre-norm (LN inside, before the sublayer) trains far more stably at depth because the residual path stays an identity — this is why nearly all large models use pre-norm. The FFN holds roughly 8d28 d^2 parameters per block and dominates the parameter count; attention contributes 4d2\sim 4 d^2 from the Q/K/V/O projections.

4Implementation
python
1import torch
2from torch import Tensor, nn
3
4
5class TransformerBlock(nn.Module):
6    """Pre-norm decoder block: causal self-attention + position-wise FFN."""
7
8    def __init__(self, d_model: int, n_heads: int, mlp_ratio: int = 4, p: float = 0.1) -> None:
9        super().__init__()
10        self.ln1 = nn.LayerNorm(d_model)
11        self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=p, batch_first=True)
12        self.ln2 = nn.LayerNorm(d_model)
13        self.ffn = nn.Sequential(
14            nn.Linear(d_model, mlp_ratio * d_model),
15            nn.GELU(),
16            nn.Linear(mlp_ratio * d_model, d_model),
17            nn.Dropout(p),
18        )
19
20    def forward(self, x: Tensor, attn_mask: Tensor | None = None) -> Tensor:
21        h = self.ln1(x)
22        # causal mask makes this a decoder block (each token sees only the past)
23        attn_out, _ = self.attn(h, h, h, attn_mask=attn_mask, need_weights=False)
24        x = x + attn_out                       # residual 1
25        x = x + self.ffn(self.ln2(x))          # residual 2 (pre-norm)
26        return x
27
28
29block = TransformerBlock(d_model=512, n_heads=8)
30seq = torch.randn(2, 16, 512)
31causal = torch.triu(torch.full((16, 16), float("-inf")), diagonal=1)
32assert block(seq, attn_mask=causal).shape == (2, 16, 512)
5Interview Questions
  1. Conceptual: What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers, and give one model for each. (BERT = encoder/bidirectional; GPT = decoder/causal; T5 = encoder-decoder/seq2seq.)
  2. Implementation: Why are residual connections essential in a deep Transformer? (They give gradients an identity path, preventing vanishing gradients and letting each block learn a refinement.)
  3. Applied: Why pre-norm over post-norm in large models? (Pre-norm keeps the residual stream an identity, making deep stacks trainable without careful warmup; post-norm is unstable at depth.)
  4. Systems-level: Where do most of a Transformer's parameters and FLOPs live? (The FFN — ~8d² params per block — dominates parameters; attention's quadratic cost in sequence length dominates compute for long contexts.)
  5. Failure modes: Why does a Transformer need positional encoding at all? (Self-attention is permutation-invariant — without positional information it can't distinguish token order.)
6Retrieval Check

From memory: draw one pre-norm Transformer block, write the two residual equations, state what the FFN does that attention doesn't, and name the three encoder/decoder configurations with an example each. Check against Stages 2–3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts