Deep Learning
Positional Encoding
Why self-attention needs position information and how to give it — sinusoidal, learned, and rotary (RoPE) encodings — with the math, PyTorch, and interview questions.
Self-attention is permutation-invariant: shuffle the input tokens and the raw attention computation gives the same set of outputs, just reordered. That's a problem — "the dog bit the man" and "the man bit the dog" mean different things. Positional encoding injects information about where each token sits in the sequence, so the model can use order.
There are three families interviewers expect you to know: sinusoidal (the original Transformer — fixed, parameter-free), learned absolute positions (a trainable embedding per position, used by BERT/GPT-2), and rotary (RoPE) (rotates query/key vectors by a position-dependent angle, now standard in LLaMA-style models). The frame to hold: attention has no built-in sense of order, so you add position information — either to the token embeddings (absolute) or into the attention dot product itself (relative/RoPE).
Sinusoidal encoding gives each position a unique "fingerprint": a vector of sine and cosine values at many different frequencies. Low-frequency dimensions change slowly across positions (coarse location); high-frequency dimensions change fast (fine location). Because the fingerprint is built from sinusoids, the model can also represent relative offsets — the encoding of position pos + k is a fixed linear function of the encoding at pos.
flowchart LR
T["Token embedding"] --> A["add"]
P["Positional encoding (per position)"] --> A
A --> O["Position-aware input to attention"]
You simply add the positional vector to the token embedding before the first attention layer, so every token carries both "what I am" and "where I am."
The original sinusoidal encoding, for position and dimension index in a model of width :
Each pair of dimensions is a sinusoid with wavelength , geometrically increasing from to . Two useful properties: it's defined for any position (so it extrapolates to longer sequences than seen in training), and is a linear transform of , letting attention learn to attend by relative distance.
RoPE takes a different route: instead of adding to embeddings, it rotates the query and key vectors by an angle proportional to their position. The dot product then depends only on the relative offset , baking relative position directly into attention — which generalizes better to long contexts.
1import torch
2from torch import Tensor
3
4
5def sinusoidal_encoding(seq_len: int, d_model: int) -> Tensor:
6 pos = torch.arange(seq_len).unsqueeze(1) # (seq, 1)
7 i = torch.arange(0, d_model, 2) # even indices
8 div = torch.exp(i * (-torch.log(torch.tensor(10000.0)) / d_model))
9 pe = torch.zeros(seq_len, d_model)
10 pe[:, 0::2] = torch.sin(pos * div) # even dims -> sin
11 pe[:, 1::2] = torch.cos(pos * div) # odd dims -> cos
12 return pe
13
14
15pe = sinusoidal_encoding(seq_len=50, d_model=128)
16tokens = torch.randn(50, 128)
17position_aware = tokens + pe # add, don't concat
18assert pe.shape == (50, 128)- Conceptual: Why does a Transformer need positional encoding at all? (Self-attention is permutation-invariant — without position information it can't distinguish token order.)
- Implementation: Do you add or concatenate positional encodings, and why add? (Add — it keeps dimensionality fixed and lets each dimension carry both content and position; concatenation wastes width.)
- Applied: Why is sinusoidal encoding able to handle sequences longer than those seen in training? (It's a fixed function defined for any position, so it produces valid encodings beyond the training length.)
- Systems-level: What problem does RoPE solve over absolute learned positions? (It encodes relative position directly in the attention dot product, generalizing better to long contexts than fixed absolute embeddings.)
- Failure modes: What's the limitation of learned absolute positional embeddings? (They're only defined up to the trained max length and don't extrapolate — sequences longer than training have no embedding.)
From memory: write the sinusoidal PE formula, explain why you add rather than concatenate, and state one advantage of sinusoidal/RoPE over learned absolute positions. Check against Stage 3.
This is one static walkthrough. A live session goes further.
Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.
Related concepts
Deep Learning
Transformer Architecture
The Transformer block from the ground up — self-attention plus a position-wise feed-forward network, residuals and LayerNorm, and the encoder/decoder configurations — with the math, PyTorch, and calibrated interview questions.
Deep Learning
Attention Mechanisms
How scaled dot-product and multi-head attention work — the soft key-value lookup at the heart of every Transformer — with the math, runnable PyTorch, and calibrated interview questions.
LLMs
Embeddings
How discrete tokens become dense vectors that capture meaning — static (word2vec) vs. contextual embeddings, cosine similarity, and negative sampling — with code.