What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers?

Encoder-only (BERT) is bidirectional, for understanding tasks. Decoder-only (GPT) is causal/autoregressive, for generation. Encoder-decoder (T5, the original Transformer) is for sequence-to-sequence tasks like translation.

Why do Transformers use residual connections?

Residuals give gradients an identity path through the network, preventing vanishing gradients in deep stacks and letting each block learn a refinement rather than a full remapping.

What is the difference between pre-norm and post-norm Transformers?

Pre-norm applies LayerNorm inside the block before each sublayer, keeping the residual stream an identity — this trains stably at depth. Post-norm (the original) normalizes after the residual and is unstable for very deep models.

Where do most of a Transformer's parameters live?

In the feed-forward networks — roughly 8·d² parameters per block — which dominate the parameter count, while attention's quadratic cost dominates compute for long sequences.

Transformer Architecture Explained (ML Interview)

The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.

1Big Picture

The Transformer is the architecture underneath modern NLP and most large models. It threw out recurrence entirely and built sequence modeling from self-attention + feed-forward blocks, stacked with residual connections and normalization. The payoff: every position is processed in parallel (no sequential RNN unrolling), and any token can attend to any other in a single step — so training is fast and long-range dependencies are easy.

Three configurations matter for interviews: encoder-only (BERT — bidirectional, for understanding/classification), decoder-only (GPT — causal/autoregressive, for generation), and encoder-decoder (T5, original Transformer — for translation/seq2seq). The frame to hold: a Transformer is N identical blocks stacked, each block doing "mix information across positions (attention), then transform each position independently (FFN)," wrapped in residuals and LayerNorm. Master one block and you understand the whole stack.

2Intuition + Visual

One block does two things in sequence. Attention lets every token gather context from every other token — it's the only place information moves between positions. The feed-forward network (FFN) then processes each position independently, adding capacity and non-linearity. Residual connections (x + sublayer(x)) give gradients a clean path and let the network learn refinements rather than full remappings; LayerNorm keeps activations well-scaled.

Because attention is permutation-invariant, a Transformer has no inherent notion of order — positional encoding is injected at the input so the model knows token positions.

3The Math

A modern pre-norm Transformer block, for input $x$ :

h = x + \text{MHA}(\text{LN}(x))

y = h + \text{FFN}(\text{LN}(h))

The feed-forward network is a two-layer MLP applied position-wise, usually expanding to $4 d_{\text{model}}$ :

\text{FFN}(z) = W_2\,\phi(W_1 z + b_1) + b_2

where $\phi$ is a non-linearity (ReLU, or GELU in most LLMs), $W_1 \in \mathbb{R}^{4d \times d}$ , $W_2 \in \mathbb{R}^{d \times 4d}$ .

Pre-norm vs. post-norm: the original paper put LayerNorm after the residual ( $\text{LN}(x + \text{sublayer}(x))$ ). Pre-norm (LN inside, before the sublayer) trains far more stably at depth because the residual path stays an identity — this is why nearly all large models use pre-norm. The FFN holds roughly $8 d^2$ parameters per block and dominates the parameter count; attention contributes $\sim 4 d^2$ from the Q/K/V/O projections.

4Implementation

python

1import torch
2from torch import Tensor, nn
3
4
5class TransformerBlock(nn.Module):
6    """Pre-norm decoder block: causal self-attention + position-wise FFN."""
7
8    def __init__(self, d_model: int, n_heads: int, mlp_ratio: int = 4, p: float = 0.1) -> None:
9        super().__init__()
10        self.ln1 = nn.LayerNorm(d_model)
11        self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=p, batch_first=True)
12        self.ln2 = nn.LayerNorm(d_model)
13        self.ffn = nn.Sequential(
14            nn.Linear(d_model, mlp_ratio * d_model),
15            nn.GELU(),
16            nn.Linear(mlp_ratio * d_model, d_model),
17            nn.Dropout(p),
18        )
19
20    def forward(self, x: Tensor, attn_mask: Tensor | None = None) -> Tensor:
21        h = self.ln1(x)
22        # causal mask makes this a decoder block (each token sees only the past)
23        attn_out, _ = self.attn(h, h, h, attn_mask=attn_mask, need_weights=False)
24        x = x + attn_out                       # residual 1
25        x = x + self.ffn(self.ln2(x))          # residual 2 (pre-norm)
26        return x
27
28
29block = TransformerBlock(d_model=512, n_heads=8)
30seq = torch.randn(2, 16, 512)
31causal = torch.triu(torch.full((16, 16), float("-inf")), diagonal=1)
32assert block(seq, attn_mask=causal).shape == (2, 16, 512)

5Interview Questions

Conceptual: What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers, and give one model for each. (BERT = encoder/bidirectional; GPT = decoder/causal; T5 = encoder-decoder/seq2seq.)
Implementation: Why are residual connections essential in a deep Transformer? (They give gradients an identity path, preventing vanishing gradients and letting each block learn a refinement.)
Applied: Why pre-norm over post-norm in large models? (Pre-norm keeps the residual stream an identity, making deep stacks trainable without careful warmup; post-norm is unstable at depth.)
Systems-level: Where do most of a Transformer's parameters and FLOPs live? (The FFN — ~8d² params per block — dominates parameters; attention's quadratic cost in sequence length dominates compute for long contexts.)
Failure modes: Why does a Transformer need positional encoding at all? (Self-attention is permutation-invariant — without positional information it can't distinguish token order.)

6Retrieval Check

From memory: draw one pre-norm Transformer block, write the two residual equations, state what the FFN does that attention doesn't, and name the three encoder/decoder configurations with an example each. Check against Stages 2–3.

Transformer Architecture

Related concepts