Section 05

The math

Efficient Estimation of Word Representations in Vector Space (Word2Vec) 2013

5. The math — loss, gradients, and the analogy magic with real numbers

This section has three parts:

  1. The loss function that Word2Vec optimises.
  2. The gradient update rules.
  3. A worked numeric example of the famous analogy king − man + woman ≈ queen, plus an Indian variant Delhi − India + Japan ≈ Tokyo.

Prerequisites you should have nearby: dot product, matrix multiplication, probability basics.

5.1 The full-softmax objective (the slow version)

For a single training example — target word w and neighbour word c — the naive skip-gram objective is to maximise:

P(c | w) = exp(v_w · v'_c) / Σ_{j=1}^V exp(v_w · v'_j)
  • v_w is the target’s embedding (row w of matrix W).
  • v'_c is the context vector for word c (row c of matrix W’).
  • The sum in the denominator runs over the entire vocabulary — this is softmax.

Summed across all training pairs, the total log-likelihood is:

J = Σ_(w,c) log P(c | w)

We want to maximise J — equivalently, minimise −J. This is beautiful math but, as noted in Section 4, the softmax denominator makes it impractical for big vocabularies.

5.2 The negative-sampling objective (the fast version)

Instead of the full softmax, negative sampling turns each (w, c) pair into (1 + k) binary classifications. The objective for one training pair becomes:

J_{ns}(w, c) = log σ(v_w · v'_c)
              + Σ_{i=1}^k log σ(−v_w · v'_{n_i})

where:

  • σ(x) = 1 / (1 + e⁻ˣ) is the sigmoid.
  • n_i is the i-th negative sample (a random word).
  • k is the number of negatives per positive (5 to 20 in practice).

Read this aloud:

“Make σ(v_w · v'_c) close to 1 for real neighbours, and make σ(−v_w · v'_n) close to 1 for random negatives (which means the dot product v_w · v'_n is pushed toward negative values).”

We maximise J_ns (i.e., minimise −J_ns) by gradient ascent on each vector involved. No softmax, no sum over V. Computing one training example is now only O(k+1) dot products, not O(V).

5.3 The gradient update

Let’s compute the gradient of the positive term with respect to v_w:

∂/∂v_w log σ(v_w · v'_c)
= ( 1 − σ(v_w · v'_c) ) · v'_c

And with respect to v'_c:

∂/∂v'_c log σ(v_w · v'_c)
= ( 1 − σ(v_w · v'_c) ) · v_w

The term (1 − σ(·)) is how “wrong” the network currently is. When the dot product is already very large, σ is near 1 and the gradient is near 0 — no more updating needed. When σ is near 0, the gradient is large — big update needed.

For a negative sample n:

∂/∂v_w log σ(−v_w · v'_n) = −σ(v_w · v'_n) · v'_n

That’s a negative update — it pulls v_w away from v'_n.

The update rule for one training example with learning rate η is then:

v_w     ← v_w     + η · [ (1 − σ(v_w · v'_c)) · v'_c
                          − Σᵢ σ(v_w · v'_{n_i}) · v'_{n_i} ]
v'_c    ← v'_c    + η · (1 − σ(v_w · v'_c)) · v_w
v'_{n_i} ← v'_{n_i} − η · σ(v_w · v'_{n_i}) · v_w

If this looks like a lot, don’t worry. It is three lines of code in any real implementation, and the structure is: “pull real pairs together, push fake pairs apart”. That is the whole training dynamic.

5.4 Why the analogy trick works — informal geometry

Once trained, the vectors live in a 300-dimensional space where certain directions correspond to conceptual axes:

  • A “gender” axis that separates man/woman, king/queen, actor/actress.
  • A “country → capital” axis that separates India/Delhi, Japan/Tokyo, France/Paris.
  • A “verb tense” axis that separates walk/walked, run/ran.

These axes are not explicitly designed — they emerge because the training data contains regularities that are easier to explain if different conceptual dimensions end up pointing in different directions in vector space.

An analogy like king − man + woman traces this geometry:

  1. king − man computes the direction that takes you from “a regular adult male” to “a male monarch”. Roughly, it’s the “royalty” vector.
  2. Adding that to woman takes “a regular adult female” in the royalty direction — landing near “queen”.

The word closest to vec("king") − vec("man") + vec("woman") (by cosine similarity) tends to be “queen” in a well-trained model.

Why cosine similarity? Because the important thing about a word vector is its direction, not its magnitude. Two vectors pointing the same way have cosine similarity near 1; opposite directions give −1; orthogonal directions give 0. The classic post-Word2Vec evaluation takes the top-1 or top-5 nearest vectors by cosine similarity.

5.5 Worked example — king − man + woman ≈ queen

We’ll use made-up but plausible 4-dimensional vectors (real Word2Vec uses 300 dimensions, but 4 is enough to illustrate). Say after training we have:

vec(king)   = [ 0.90,  0.50,  0.10,  0.05 ]
vec(man)    = [ 0.80,  0.05,  0.10,  0.00 ]
vec(woman)  = [ 0.05,  0.05,  0.85,  0.00 ]
vec(queen)  = [ 0.10,  0.55,  0.85,  0.05 ]

The first dimension tracks “maleness”, the second “royalty”, the third “femaleness”, the fourth is just some other direction.

Compute the analogy:

vec(king) − vec(man) = [0.90−0.80, 0.50−0.05, 0.10−0.10, 0.05−0.00]
                     = [0.10,  0.45,  0.00,  0.05]
                       ← this is the "royalty − maleness" direction

Add vec(woman):

vec(king) − vec(man) + vec(woman)
  = [0.10+0.05, 0.45+0.05, 0.00+0.85, 0.05+0.00]
  = [0.15, 0.50, 0.85, 0.05]

Now compare to vec(queen) = [0.10, 0.55, 0.85, 0.05]. They’re almost identical — off by a few hundredths in each dimension. Among all words in the vocabulary, “queen” would be the closest (by cosine similarity) to our computed analogy vector. The math worked.

5.6 Indian variant — Delhi − India + Japan ≈ Tokyo

Same 4-dimensional toy vectors, but now tracking country/capital semantics. After training:

vec(Delhi)  = [ 0.85,  0.95,  0.10,  0.05 ]
vec(India)  = [ 0.10,  0.95,  0.15,  0.05 ]
vec(Japan)  = [ 0.10,  0.20,  0.90,  0.05 ]
vec(Tokyo)  = [ 0.85,  0.20,  0.85,  0.05 ]

Here:

  • Dimension 1 tracks “is-a-capital”.
  • Dimension 2 tracks “India-ness”.
  • Dimension 3 tracks “Japan-ness”.
  • Dimension 4 is noise.

Compute:

vec(Delhi) − vec(India) = [0.75, 0.00, −0.05, 0.00]
                         ← the "is-a-capital, with the India-ness cancelled"
                           direction

vec(Delhi) − vec(India) + vec(Japan)
  = [0.75 + 0.10, 0.00 + 0.20, −0.05 + 0.90, 0.00 + 0.05]
  = [0.85, 0.20, 0.85, 0.05]

Compare to vec(Tokyo) = [0.85, 0.20, 0.85, 0.05]. Exact match.

In a real Word2Vec model you won’t get exact matches, but the nearest neighbour of the computed vector will usually be “Tokyo”. This is why, in the famous Google Analogy Test, Word2Vec could solve questions like “Delhi is to India as _____ is to Japan” with 80-90% accuracy — better than any previous method.

5.7 A reality-check note

The “analogy arithmetic works perfectly” framing is mostly true but occasionally oversold. In published evaluations:

  • Capital-country, gender, verb-tense analogies: ~70-85% top-1 accuracy.
  • Some categories (like comparative-superlative) work less reliably.
  • The analogy can fail for very rare words, because their vectors are trained on fewer examples.

So ”≈” in king − man + woman ≈ queen is doing real work. The actual computed vector is usually close to queen, not exactly equal. “Queen” is typically one of the top 1-3 nearest words, and sometimes the question word itself (like “king”) is excluded from the list to avoid trivial solutions.

This will all be concrete in the next section, where you run it yourself on pretrained Google News vectors.

Next: the code — gensim demo in 25 lines.