Why does a Transformer need positional encoding?

Self-attention is permutation-invariant — it has no built-in sense of token order — so positional encoding injects where each token sits in the sequence.

Do you add or concatenate positional encodings?

Add them to the token embeddings. Adding keeps the dimensionality fixed and lets each dimension carry both content and position; concatenation would waste model width.

What is RoPE (rotary positional encoding)?

RoPE rotates query and key vectors by a position-dependent angle so the attention dot product depends only on relative position, which generalizes better to long contexts than absolute encodings.

Why can sinusoidal encoding handle longer sequences than seen in training?

It's a fixed function defined for any position, so it produces valid encodings beyond the training length, unlike learned absolute embeddings.

Positional Encoding Explained (ML Interview)

Why self-attention needs position information and how to give it — sinusoidal, learned, and rotary (RoPE) encodings — with the math, PyTorch, and interview questions.

1Big Picture

Self-attention is permutation-invariant: shuffle the input tokens and the raw attention computation gives the same set of outputs, just reordered. That's a problem — "the dog bit the man" and "the man bit the dog" mean different things. Positional encoding injects information about where each token sits in the sequence, so the model can use order.

There are three families interviewers expect you to know: sinusoidal (the original Transformer — fixed, parameter-free), learned absolute positions (a trainable embedding per position, used by BERT/GPT-2), and rotary (RoPE) (rotates query/key vectors by a position-dependent angle, now standard in LLaMA-style models). The frame to hold: attention has no built-in sense of order, so you add position information — either to the token embeddings (absolute) or into the attention dot product itself (relative/RoPE).

2Intuition + Visual

Sinusoidal encoding gives each position a unique "fingerprint": a vector of sine and cosine values at many different frequencies. Low-frequency dimensions change slowly across positions (coarse location); high-frequency dimensions change fast (fine location). Because the fingerprint is built from sinusoids, the model can also represent relative offsets — the encoding of position pos + k is a fixed linear function of the encoding at pos.

You simply add the positional vector to the token embedding before the first attention layer, so every token carries both "what I am" and "where I am."

3The Math

The original sinusoidal encoding, for position $pos$ and dimension index $i$ in a model of width $d$ :

PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \qquad PE(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Each pair of dimensions $(2i, 2i+1)$ is a sinusoid with wavelength $2\pi \cdot 10000^{2i/d}$ , geometrically increasing from $2\pi$ to $10000 \cdot 2\pi$ . Two useful properties: it's defined for any position (so it extrapolates to longer sequences than seen in training), and $PE(pos+k)$ is a linear transform of $PE(pos)$ , letting attention learn to attend by relative distance.

RoPE takes a different route: instead of adding to embeddings, it rotates the query and key vectors by an angle proportional to their position. The dot product $q_m \cdot k_n$ then depends only on the relative offset $m - n$ , baking relative position directly into attention — which generalizes better to long contexts.

4Implementation

python

1import torch
2from torch import Tensor
3
4
5def sinusoidal_encoding(seq_len: int, d_model: int) -> Tensor:
6    pos = torch.arange(seq_len).unsqueeze(1)                 # (seq, 1)
7    i = torch.arange(0, d_model, 2)                          # even indices
8    div = torch.exp(i * (-torch.log(torch.tensor(10000.0)) / d_model))
9    pe = torch.zeros(seq_len, d_model)
10    pe[:, 0::2] = torch.sin(pos * div)                       # even dims -> sin
11    pe[:, 1::2] = torch.cos(pos * div)                       # odd dims  -> cos
12    return pe
13
14
15pe = sinusoidal_encoding(seq_len=50, d_model=128)
16tokens = torch.randn(50, 128)
17position_aware = tokens + pe                                 # add, don't concat
18assert pe.shape == (50, 128)

5Interview Questions

Conceptual: Why does a Transformer need positional encoding at all? (Self-attention is permutation-invariant — without position information it can't distinguish token order.)
Implementation: Do you add or concatenate positional encodings, and why add? (Add — it keeps dimensionality fixed and lets each dimension carry both content and position; concatenation wastes width.)
Applied: Why is sinusoidal encoding able to handle sequences longer than those seen in training? (It's a fixed function defined for any position, so it produces valid encodings beyond the training length.)
Systems-level: What problem does RoPE solve over absolute learned positions? (It encodes relative position directly in the attention dot product, generalizing better to long contexts than fixed absolute embeddings.)
Failure modes: What's the limitation of learned absolute positional embeddings? (They're only defined up to the trained max length and don't extrapolate — sequences longer than training have no embedding.)

6Retrieval Check

From memory: write the sinusoidal PE formula, explain why you add rather than concatenate, and state one advantage of sinusoidal/RoPE over learned absolute positions. Check against Stage 3.

Positional Encoding

Related concepts