Paper 05
Intermediate

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

Word2Vec — Mikolov, Chen, Corrado, Dean (2013)

TL;DR

Before 2013, neural networks treated the word “chai” as the vector [0, 0, 0, 1, 0, 0, ...] — a column of zeros with a single 1 somewhere. Every word was equally different from every other word. “Chai” and “coffee” were as far apart as “chai” and “submarine”. Machines had no sense of meaning.

Mikolov and colleagues at Google proposed something almost magical: train a very small neural network to do a dummy task — predict nearby words in a sentence — and then throw away the network. What you keep is the hidden layer’s weights. Those weights, one row per word, turn out to encode meaning.

Suddenly “chai” and “coffee” had vectors close to each other. “Delhi” and “India” were neighbours, as were “Tokyo” and “Japan”. Most famously:

king − man + woman ≈ queen

You could do arithmetic with words. The vectors captured not just similarity but analogical structure.

The paper trained on a few billion words, produced 300-dimensional vectors for a million words in under a day on one machine, and released the code for free. It was the spark that ignited the deep-learning boom in NLP.

The journey in one line

Words as meaningless IDs → train a tiny network on a fake task → the side-effect of training is that meaning appears in the weights.

What you will learn

  1. Why one-hot vectors carry zero meaning.
  2. Two training tricks — CBOW and Skip-gram — that learn word embeddings.
  3. Why “predict your neighbours” is a clever proxy for learning meaning.
  4. A worked numerical example: how king − man + woman ends up near queen.
  5. An Indian version: Delhi − India + Japan ≈ Tokyo.
  6. A 25-line Python demo that loads pretrained Word2Vec and plays with it.
  7. Why Word2Vec eventually lost to contextual embeddings (BERT, Paper 11).

Sections

  1. Historical context — NLP before meaning
  2. The problem — one-hot vectors and the sparsity trap
  3. The core idea — meaning as a by-product of prediction
  4. How it works — CBOW, skip-gram, and negative sampling
  5. The math — the loss function, worked king-queen arithmetic
  6. The code — gensim demo in 25 lines
  7. Impact — NLP’s first true breakthrough
  8. Limitations — one vector per word, no context
  9. What came next — GloVe, ELMo, BERT, the embedding everywhere

Resources

  • Glossary — every new term used in this paper
  • Quiz — 5 questions to test your understanding
  • Further reading — blogs, videos, original paper

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.