Word2Vec — Mikolov, Chen, Corrado, Dean (2013)

TL;DR

Before 2013, neural networks treated the word “chai” as the vector [0, 0, 0, 1, 0, 0, ...] — a column of zeros with a single 1 somewhere. Every word was equally different from every other word. “Chai” and “coffee” were as far apart as “chai” and “submarine”. Machines had no sense of meaning.

Mikolov and colleagues at Google proposed something almost magical: train a very small neural network to do a dummy task — predict nearby words in a sentence — and then throw away the network. What you keep is the hidden layer’s weights. Those weights, one row per word, turn out to encode meaning.

Suddenly “chai” and “coffee” had vectors close to each other. “Delhi” and “India” were neighbours, as were “Tokyo” and “Japan”. Most famously:

king − man + woman ≈ queen

You could do arithmetic with words. The vectors captured not just similarity but analogical structure.

The paper trained on a few billion words, produced 300-dimensional vectors for a million words in under a day on one machine, and released the code for free. It was the spark that ignited the deep-learning boom in NLP.

The journey in one line

Words as meaningless IDs → train a tiny network on a fake task → the side-effect of training is that meaning appears in the weights.

What you will learn

Why one-hot vectors carry zero meaning.
Two training tricks — CBOW and Skip-gram — that learn word embeddings.
Why “predict your neighbours” is a clever proxy for learning meaning.
A worked numerical example: how king − man + woman ends up near queen.
An Indian version: Delhi − India + Japan ≈ Tokyo.
A 25-line Python demo that loads pretrained Word2Vec and plays with it.
Why Word2Vec eventually lost to contextual embeddings (BERT, Paper 11).

Sections

Historical context — NLP before meaning
The problem — one-hot vectors and the sparsity trap
The core idea — meaning as a by-product of prediction
How it works — CBOW, skip-gram, and negative sampling
The math — the loss function, worked king-queen arithmetic
The code — gensim demo in 25 lines
Impact — NLP’s first true breakthrough
Limitations — one vector per word, no context
What came next — GloVe, ELMo, BERT, the embedding everywhere

Resources

Glossary — every new term used in this paper
Quiz — 5 questions to test your understanding
Further reading — blogs, videos, original paper

Efficient Estimation of Word Representations in Vector Space (Word2Vec)