4. How it works — training word vectors with skip-gram
We’ll walk through the skip-gram training process end to end. If you understand this one flow, you understand Word2Vec.
Setup
You have a corpus — say, 100 million sentences of news and Wikipedia. You decide on:
- Vocabulary size V. Typically the top 100,000 or 1,000,000 most
common words. Everything rarer is replaced by a special
<UNK>token. - Embedding size d. Typically 100 to 300.
- Window size c. Typically 5. This means “neighbour” = within 5 words on each side.
You initialise two matrices randomly:
- W: shape (V, d). This will hold the final word embeddings. Each
row
W[i]is the embedding for word i. - W’: shape (V, d). This will hold “context vectors” used during
training. Each row
W'[j]is used when word j is acting as a predicted neighbour.
Yes, we need two sets of vectors per word — the “input” and “output” versions. Only W survives training.
Step 1 — scan the corpus into (target, neighbour) pairs
Take the sentence:
The students drink chai in the morning
With window size 2, around each word, we generate training pairs where the first element is the target and the second is a neighbour:
Around “chai”:
(chai, students)
(chai, drink)
(chai, in)
(chai, the)
Around “drink”:
(drink, the)
(drink, students)
(drink, chai)
(drink, in)
…and so on. Across a billion sentences, this produces tens of billions of training pairs. Each pair is a single training example.
Step 2 — the naive forward pass
For one training pair — say (chai, drink) — here is what the network
computes:
- Look up the target word’s embedding. Find row
W[chai]. That’s a d-dimensional vector. Call it v_chai. - Compute scores for all vocabulary words. For every word j in
the vocabulary, compute the dot product
v_chai · W'[j]. That gives you a vector of V scores, one per vocabulary word. - Apply softmax to get probabilities. Softmax turns the raw scores
into a probability distribution over V words:
P(j | chai) = exp(v_chai · W'[j]) / Σₖ exp(v_chai · W'[k]) - Compute loss. The true neighbour was “drink”. The loss is the
negative log-probability the network assigned to “drink”:
loss = −log P(drink | chai)
Gradient descent tweaks W and W’ to push up the probability of “drink” given “chai” — and implicitly pushes down the probability of every other word, because softmax normalises.
The fatal flaw in the naive version
Look at step 3 again. To compute softmax, we need to sum exp(v_chai · W'[k]) over every word in the vocabulary. If V = 1,000,000, that’s
a million dot products per training example.
Multiply by 10 billion training examples and you get 10 quadrillion dot products just for the forward pass. That is why, in 2013, Bengio- style neural language models trained for days.
We need a way to avoid the sum over the entire vocabulary. That’s what makes Word2Vec practical.
Step 2′ — negative sampling (the trick that made it fast)
Here is the clever twist from the second 2013 Word2Vec paper (Mikolov et al., Distributed Representations…):
Instead of asking “what’s the probability of ‘drink’ out of all V words?”, ask only two simpler yes/no questions: “Is ‘drink’ a real neighbour of ‘chai’? (yes)” and “Is ‘submarine’ a real neighbour of ‘chai’? (no)”.
Formally, for each real training pair (target, neighbour), we also
sample k negative examples — random words from the vocabulary,
chosen from a modified unigram distribution. For k = 5, one training
example becomes:
Positive: (chai, drink) → should predict yes
Negatives: (chai, wrench) → should predict no
(chai, marathon) → should predict no
(chai, volatile) → should predict no
(chai, submarine) → should predict no
(chai, glockenspiel) → should predict no
Now each training example is just 6 binary classification problems instead of a full softmax over 1,000,000 words. Sparse, fast, parallel.
For each binary classification, the loss uses the sigmoid function (σ):
real-pair loss = − log σ(v_chai · W'[drink])
negative-pair loss = − log σ(−v_chai · W'[submarine])
( the minus sign flips the sigmoid )
Summed across all 6 pairs, you get the total loss for this training
example. Gradient descent updates v_chai, W'[drink], and
W'[submarine] etc. so that real neighbours get pushed closer (high
dot product) and negative samples get pushed further (low dot product).
Scaled across a billion sentences, this produces the famous Word2Vec vectors.
Step 3 — subsampling of frequent words
One more engineering detail. Words like “the”, “of”, “and” appear so often that they swamp the training signal without carrying much meaning. The paper’s fix: subsample frequent words. Before generating pairs from a sentence, each word is kept with probability:
P(keep w) = min(1, √(threshold / frequency(w)))
With a typical threshold of 10⁻⁵, this drops most instances of very frequent words. The remaining sentences are denser with meaningful words, and training converges faster with better vectors.
Putting it together
The full skip-gram-with-negative-sampling pipeline is:
- Build the vocabulary from a large corpus.
- Compute each word’s frequency.
- For each sentence:
- Subsample to drop very frequent words.
- For each remaining word, pick a random window size 1 to max (this de-emphasises distant neighbours).
- For each (target, neighbour) pair:
- Sample k negative pairs.
- Compute the positive + k negative binary losses.
- Update W[target], W’[neighbour], and W’[negatives] via SGD.
- After many epochs, return W. Throw away W’ (it’s scratch space).
That’s the whole algorithm. The mathematical heart — including the objective function and how the analogy trick works geometrically — is in the math section.
An intuition for why skip-gram works
Skip-gram teaches each word’s vector to “recognise” the vectors of real neighbours via the dot product. Two consequences:
- Words with similar neighbours end up with similar vectors. If “chai” and “coffee” both need to recognise “milk”, “sugar”, “sip” as real neighbours, their vectors must both have large dot product with those context vectors. The two vectors end up pointing in nearly the same direction.
- The space organises itself by co-occurrence statistics. The direction of “king” minus “man” ends up encoding “royalty, not maleness” because those are the contexts that distinguish “king” from “man” statistically.
An analogy can be found in ration-shop language: in the ration-shop example, each customer’s purchase vector says a lot about who they are. Two customers with similar purchases are probably similar people. Skip-gram does the same for words — a word’s “purchases” are the neighbours it appears with.
Time and space cost
On Mikolov’s 2013 hardware:
- Corpus: ~6 billion words from Google News.
- Vocabulary: 1 million words.
- Embedding dim: 300.
- Negative samples per positive: 5.
- Training time: ~24 hours on one multi-core machine.
- Final matrix size: 1M × 300 = 300M parameters.
A modern laptop can train a 100,000-word Word2Vec model in under an hour. Which is why, even today, Word2Vec remains a common starting point for small-scale NLP projects, educational demos, and situations where you cannot afford a GPU.
Next: the math — the loss function and the analogy trick, with real numbers.