Skip to content

LLMs

Embeddings

How discrete tokens become dense vectors that capture meaning — static (word2vec) vs. contextual embeddings, cosine similarity, and negative sampling — with code.

7 min readReviewed May 2026

1Big Picture

An embedding maps a discrete symbol — a token, a word, a user, an item — to a dense vector of real numbers, where geometric closeness encodes semantic similarity. Instead of one-hot vectors (huge, sparse, and equidistant), embeddings put related things near each other in a continuous space the model can do math on.

Two flavors matter for interviews. Static embeddings (word2vec, GloVe) assign one fixed vector per word, learned so that words appearing in similar contexts land nearby — but "bank" has a single vector regardless of meaning. Contextual embeddings (the hidden states inside a Transformer) give a word a different vector depending on its sentence, resolving polysemy. The frame to hold: an embedding layer is just a lookup table of learned vectors, trained so that similarity in vector space matches similarity in meaning — and that same idea powers semantic search, recommendations, and RAG.

2Intuition + Visual

Picture words as points in space. Training pulls words that share contexts together and pushes unrelated ones apart, until directions in the space become meaningful — the famous king − man + woman ≈ queen. Similarity is read off with a dot product or cosine.

flowchart LR
    ID["Token id (integer)"] --> E["Embedding table lookup (V × d)"]
    E --> V["Dense vector (d-dim)"]
    V --> S["Similarity = cosine / dot product"]

In a Transformer, this lookup is the very first layer; the vectors are then refined by attention so the same token ends up with a context-dependent representation deeper in the network.

3The Math

An embedding layer is a matrix ERV×dE \in \mathbb{R}^{V \times d} (VV = vocabulary size, dd = embedding dimension). Token id tt maps to row EtE_t — a lookup, equivalent to multiplying a one-hot vector by EE.

word2vec (skip-gram) learns EE by predicting context words from a center word. For center word cc and context word oo, it maximizes:

P(oc)=exp(uovc)wVexp(uwvc)P(o \mid c) = \frac{\exp(u_o^\top v_c)}{\sum_{w \in V} \exp(u_w^\top v_c)}

where vcv_c is the center embedding and uou_o the context embedding. The full softmax over VV is expensive, so practical training uses negative sampling — distinguish the true context word from a few random "negative" words instead of normalizing over the whole vocabulary.

Similarity between two embeddings uses cosine, which ignores magnitude and compares direction:

cos(a,b)=abab\text{cos}(a, b) = \frac{a \cdot b}{\lVert a \rVert\, \lVert b \rVert}
4Implementation
python
1import torch
2import torch.nn.functional as F
3from torch import nn
4
5vocab_size, dim = 10000, 64
6embed = nn.Embedding(vocab_size, dim)        # the lookup table E (V × d)
7
8ids = torch.tensor([42, 7, 1001])
9vectors = embed(ids)                          # (3, 64) — one row per id
10assert vectors.shape == (3, 64)
11
12# semantic similarity between two tokens
13a, b = embed(torch.tensor(42)), embed(torch.tensor(7))
14similarity = F.cosine_similarity(a, b, dim=0)  # scalar in [-1, 1]
5Interview Questions
  1. Conceptual: What's the difference between static and contextual embeddings? (Static, e.g. word2vec, gives one fixed vector per word; contextual, e.g. Transformer hidden states, gives a different vector per occurrence based on surrounding context.)
  2. Implementation: Why use cosine similarity rather than raw dot product or Euclidean distance? (Cosine compares direction independent of magnitude, which suits embeddings where length can vary with token frequency.)
  3. Applied: Why does word2vec use negative sampling? (The full softmax over the vocabulary is too expensive; negative sampling approximates it by separating the true context word from a few random negatives.)
  4. Systems-level: How do embeddings enable semantic search and RAG? (Embed documents and the query into the same space; retrieve by nearest-neighbor similarity rather than keyword match.)
  5. Failure modes: What does a single static embedding fail to capture, with an example? (Polysemy — "bank" (river) vs. "bank" (money) share one vector; contextual embeddings fix this.)
6Retrieval Check

From memory: define an embedding layer and its shape, write the cosine-similarity formula, and explain static vs. contextual embeddings with one example. Check against Stage 3.

This is one static walkthrough. A live session goes further.

Ask follow-ups at interview depth, get the math and code rendered as you go, and run a retrieval drill until it sticks — then come back to the thread anytime.

Related concepts