What is the difference between static and contextual embeddings?

Static embeddings (word2vec, GloVe) give one fixed vector per word; contextual embeddings (Transformer hidden states) give a different vector per occurrence based on surrounding context, resolving polysemy.

Why use cosine similarity for embeddings?

Cosine compares direction independent of magnitude, which suits embeddings where vector length can vary with token frequency.

Why does word2vec use negative sampling?

The full softmax over the vocabulary is too expensive to compute each step; negative sampling approximates it by separating the true context word from a few random negatives.

How do embeddings power semantic search and RAG?

Documents and queries are embedded into the same vector space, so relevant results are found by nearest-neighbor similarity rather than exact keyword match.

Embeddings Explained: word2vec & Contextual (ML Interview)

How discrete tokens become dense vectors that capture meaning — static (word2vec) vs. contextual embeddings, cosine similarity, and negative sampling — with code.

1Big Picture

An embedding maps a discrete symbol — a token, a word, a user, an item — to a dense vector of real numbers, where geometric closeness encodes semantic similarity. Instead of one-hot vectors (huge, sparse, and equidistant), embeddings put related things near each other in a continuous space the model can do math on.

Two flavors matter for interviews. Static embeddings (word2vec, GloVe) assign one fixed vector per word, learned so that words appearing in similar contexts land nearby — but "bank" has a single vector regardless of meaning. Contextual embeddings (the hidden states inside a Transformer) give a word a different vector depending on its sentence, resolving polysemy. The frame to hold: an embedding layer is just a lookup table of learned vectors, trained so that similarity in vector space matches similarity in meaning — and that same idea powers semantic search, recommendations, and RAG.

2Intuition + Visual

Picture words as points in space. Training pulls words that share contexts together and pushes unrelated ones apart, until directions in the space become meaningful — the famous king − man + woman ≈ queen. Similarity is read off with a dot product or cosine.

In a Transformer, this lookup is the very first layer; the vectors are then refined by attention so the same token ends up with a context-dependent representation deeper in the network.

3The Math

An embedding layer is a matrix $E \in \mathbb{R}^{V \times d}$ ( $V$ = vocabulary size, $d$ = embedding dimension). Token id $t$ maps to row $E_t$ — a lookup, equivalent to multiplying a one-hot vector by $E$ .

word2vec (skip-gram) learns $E$ by predicting context words from a center word. For center word $c$ and context word $o$ , it maximizes:

P(o \mid c) = \frac{\exp(u_o^\top v_c)}{\sum_{w \in V} \exp(u_w^\top v_c)}

where $v_c$ is the center embedding and $u_o$ the context embedding. The full softmax over $V$ is expensive, so practical training uses negative sampling — distinguish the true context word from a few random "negative" words instead of normalizing over the whole vocabulary.

Similarity between two embeddings uses cosine, which ignores magnitude and compares direction:

\text{cos}(a, b) = \frac{a \cdot b}{\lVert a \rVert\, \lVert b \rVert}

4Implementation

python

1import torch
2import torch.nn.functional as F
3from torch import nn
4
5vocab_size, dim = 10000, 64
6embed = nn.Embedding(vocab_size, dim)        # the lookup table E (V × d)
7
8ids = torch.tensor([42, 7, 1001])
9vectors = embed(ids)                          # (3, 64) — one row per id
10assert vectors.shape == (3, 64)
11
12# semantic similarity between two tokens
13a, b = embed(torch.tensor(42)), embed(torch.tensor(7))
14similarity = F.cosine_similarity(a, b, dim=0)  # scalar in [-1, 1]

5Interview Questions

Conceptual: What's the difference between static and contextual embeddings? (Static, e.g. word2vec, gives one fixed vector per word; contextual, e.g. Transformer hidden states, gives a different vector per occurrence based on surrounding context.)
Implementation: Why use cosine similarity rather than raw dot product or Euclidean distance? (Cosine compares direction independent of magnitude, which suits embeddings where length can vary with token frequency.)
Applied: Why does word2vec use negative sampling? (The full softmax over the vocabulary is too expensive; negative sampling approximates it by separating the true context word from a few random negatives.)
Systems-level: How do embeddings enable semantic search and RAG? (Embed documents and the query into the same space; retrieve by nearest-neighbor similarity rather than keyword match.)
Failure modes: What does a single static embedding fail to capture, with an example? (Polysemy — "bank" (river) vs. "bank" (money) share one vector; contextual embeddings fix this.)

6Retrieval Check

From memory: define an embedding layer and its shape, write the cosine-similarity formula, and explain static vs. contextual embeddings with one example. Check against Stage 3.

Embeddings

Related concepts