3. The core idea — meaning as the by-product of a fake task
Here is the whole Word2Vec idea in three sentences:
- Set up a very small neural network to do a dummy task: given a word, predict its neighbours in a sentence.
- Train it on billions of sentences.
- Throw away the network. Keep the hidden layer’s weights. Those weights, one row per word, are your word vectors.
That’s it. Read those three sentences again. The counter-intuitive step is the third one: the thing we trained was not the thing we want. We wanted word vectors. We trained a predictor. The predictor’s internal state — the thing normally ignored — is the actual prize.
An Indian-life analogy — learning who’s who at a wedding
Imagine you are at a big Indian wedding — 400 guests, five days of functions. You don’t know most people. But over five days you watch carefully who stands with whom, who sits at the same table, who shares chai, who dances together, who argues about cricket scores.
By the last day, without being told anything explicitly, you have a mental map:
- The groom’s college friends — one tight cluster. They stand together, joke together, call each other by old nicknames.
- The bride’s extended family aunties — another cluster.
- The “business contacts” uncle group — cluster three.
You’ve never been handed a membership list. You learned the social structure purely by watching who stood near whom. Someone asks you, “How is Raju related to Priya?” — and you answer instantly: “Raju is in the groom’s college-friends cluster, Priya is in the bride’s family cluster. They barely know each other.”
That is exactly what Word2Vec does for words. It watches who appears near whom across billions of sentences. Two words that consistently appear in similar company end up with similar vectors — they’re in the same “cluster” of meaning.
The two tiny architectures
The paper actually proposes two closely related training tasks. They’re variations on a theme.
Architecture A — CBOW (Continuous Bag of Words)
Task: Given the surrounding words, predict the word in the middle.
Example sentence: The students drink chai in the morning.
Pick “chai” as the target. Take a window of ±2 words on each side:
students, drink, in, the. This is the context.
The network is asked: “Given this context, what word goes in the middle?”
If the network learns to answer “chai” — it must have internally learned that “students”, “drink”, “in”, “the” together point toward “chai” more than they point toward “submarine”.
Architecture B — Skip-gram
Task: Given the middle word, predict each of the surrounding words.
Same example: the target is “chai”. The network is given “chai” and asked to predict each of the four context words individually: “students”, “drink”, “in”, “the”.
So skip-gram flips the direction. Instead of “context predicts word”, it’s “word predicts context”.
Which is better?
Both work. In the paper:
- CBOW is faster to train — it predicts a single word, averaging the context.
- Skip-gram is slower but works better on rare words, because every word gets many training examples (once for each of its context words).
In practice, skip-gram with negative sampling is the workhorse. When people say “Word2Vec”, they usually mean skip-gram. We’ll focus on it from Section 4 onward.
What the network looks like (from 10,000 feet)
It is almost embarrassingly simple. Two layers. No nonlinearity in the middle. For skip-gram:
Input: one-hot vector for the target word (size V = vocab size)
│
▼
[ W ] ← embedding matrix, shape (V × d), d = 300 typically
│
▼
Hidden: the target word's d-dimensional embedding
│
▼
[ W' ] ← output matrix, shape (d × V)
│
▼
Output: a probability for every word in the vocabulary
(each one being a neighbour of the target)
- V is vocabulary size — often 100,000 to a million.
- d is embedding size — typically 100 to 300.
- W is a V × d matrix. Each row is the embedding for one word. Feeding in a one-hot vector for word i simply selects row i of W. This is a matrix-vector multiplication — see the matrix multiplication tutorial.
- W’ is a second weight matrix. After training, it is thrown away.
All the meaning we care about lives in W. That’s our word vector table. We train the whole thing, then keep only W.
Why “predict the neighbours” captures meaning
The training signal works like this. Suppose we see “chai” appear next to “milk” a thousand times, and next to “sugar” another thousand times, and next to “submarine” never. Gradient descent will adjust the vector for “chai” so that it predicts “milk” and “sugar” well — and doesn’t waste probability on “submarine”.
Now separately, the word “coffee” also appears next to “milk” and “sugar” and never next to “submarine”. Gradient descent adjusts its vector the same way. Both “chai” and “coffee” get pushed toward configurations that produce similar neighbour-predictions, which means they get pushed toward similar vectors.
The upshot: chai and coffee end up with almost identical vectors, not because anyone told the network they’re related, but because they predict the same neighbours.
Scale this up to a billion sentences and millions of words, and a beautiful geometry emerges. Words cluster by topic. Synonyms end up nearly identical. Related words are close. Unrelated words are far. And — the magic — analogical relationships start to live in straight-line directions in the vector space.
The analogy magic, previewed
Here is the part people remember. If Word2Vec is trained well:
vec("king") − vec("man") + vec("woman") ≈ vec("queen")
The vector vec("king") − vec("man") captures something like the
abstract concept of “royalty + maleness removed”. Adding vec("woman")
puts the maleness back as femaleness, landing you near “queen”.
Nobody designed this. It’s a side-effect of training on billions of sentences. The geometry just… works.
And it’s not just English. The same thing works with Indian geography:
vec("Delhi") − vec("India") + vec("Japan") ≈ vec("Tokyo")
Or gender:
vec("actor") − vec("man") + vec("woman") ≈ vec("actress")
Or verb tense:
vec("walked") − vec("walk") + vec("run") ≈ vec("ran")
We’ll verify these with real numbers in the math section and run the actual Python code in Section 6.
What you should be holding in your head
- A one-layer network trained on “predict neighbours” learns a surprisingly rich representation.
- The “representation” is just the weight matrix W. Each row is a word’s vector.
- Similar words end up with similar vectors, and analogies end up as directions in vector space.
- We haven’t explained how the network is made fast enough to train on a billion words — that’s negative sampling, covered in the next section.
Next: how it works, including the training trick that made it practical.