Skip to content

Deep Learning

Positional Encoding

Why self-attention needs position information and how to give it — sinusoidal, learned, and rotary (RoPE) encodings — with the math, PyTorch, and interview questions.

7 min readReviewed May 2026

1Big Picture

Self-attention is permutation-invariant: shuffle the input tokens and the raw attention computation gives the same set of outputs, just reordered. That's a problem — "the dog bit the man" and "the man bit the dog" mean different things. Positional encoding injects information about where each token sits in the sequence, so the model can use order.

There are three families interviewers expect you to know: sinusoidal (the original Transformer — fixed, parameter-free), learned absolute positions (a trainable embedding per position, used by BERT/GPT-2), and rotary (RoPE) (rotates query/key vectors by a position-dependent angle, now standard in LLaMA-style models). The frame to hold: attention has no built-in sense of order, so you add position information — either to the token embeddings (absolute) or into the attention dot product itself (relative/RoPE).

2Intuition + Visual

Sinusoidal encoding gives each position a unique "fingerprint": a vector of sine and cosine values at many different frequencies. Low-frequency dimensions change slowly across positions (coarse location); high-frequency dimensions change fast (fine location). Because the fingerprint is built from sinusoids, the model can also represent relative offsets — the encoding of position pos + k is a fixed linear function of the encoding at pos.

flowchart LR
    T["Token embedding"] --> A["add"]
    P["Positional encoding (per position)"] --> A
    A --> O["Position-aware input to attention"]

You simply add the positional vector to the token embedding before the first attention layer, so every token carries both "what I am" and "where I am."

3The Math

The original sinusoidal encoding, for position pospos and dimension index ii in a model of width dd:

PE(pos,2i)=sin ⁣(pos100002i/d),PE(pos,2i+1)=cos ⁣(pos100002i/d)PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \qquad PE(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Each pair of dimensions (2i,2i+1)(2i, 2i+1) is a sinusoid with wavelength 2π100002i/d2\pi \cdot 10000^{2i/d}, geometrically increasing from 2π2\pi to 100002π10000 \cdot 2\pi. Two useful properties: it's defined for any position (so it extrapolates to longer sequences than seen in training), and PE(pos+k)PE(pos+k) is a linear transform of PE(pos)PE(pos), letting attention learn to attend by relative distance.

RoPE takes a different route: instead of adding to embeddings, it rotates the query and key vectors by an angle proportional to their position. The dot product qmknq_m \cdot k_n then depends only on the relative offset mnm - n, baking relative position directly into attention — which generalizes better to long contexts.

4Implementation
python
1import torch
2from torch import Tensor
3
4
5def sinusoidal_encoding(seq_len: int, d_model: int) -> Tensor:
6    pos = torch.arange(seq_len).unsqueeze(1)                 # (seq, 1)
7    i = torch.arange(0, d_model, 2)                          # even indices
8    div = torch.exp(i * (-torch.log(torch.tensor(10000.0)) / d_model))
9    pe = torch.zeros(seq_len, d_model)
10    pe[:, 0::2] = torch.sin(pos * div)                       # even dims -> sin
11    pe[:, 1::2] = torch.cos(pos * div)                       # odd dims  -> cos
12    return pe
13
14
15pe = sinusoidal_encoding(seq_len=50, d_model=128)
16tokens = torch.randn(50, 128)
17position_aware = tokens + pe                                 # add, don't concat
18assert pe.shape == (50, 128)
5Interview Questions
  1. Conceptual: Why does a Transformer need positional encoding at all? (Self-attention is permutation-invariant — without position information it can't distinguish token order.)
  2. Implementation: Do you add or concatenate positional encodings, and why add? (Add — it keeps dimensionality fixed and lets each dimension carry both content and position; concatenation wastes width.)
  3. Applied: Why is sinusoidal encoding able to handle sequences longer than those seen in training? (It's a fixed function defined for any position, so it produces valid encodings beyond the training length.)
  4. Systems-level: What problem does RoPE solve over absolute learned positions? (It encodes relative position directly in the attention dot product, generalizing better to long contexts than fixed absolute embeddings.)
  5. Failure modes: What's the limitation of learned absolute positional embeddings? (They're only defined up to the trained max length and don't extrapolate — sequences longer than training have no embedding.)
6Retrieval Check

From memory: write the sinusoidal PE formula, explain why you add rather than concatenate, and state one advantage of sinusoidal/RoPE over learned absolute positions. Check against Stage 3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts