1. Historical context — NLP before meaning

In 2012, natural language processing (NLP) was in an odd place. We had beautiful statistical machinery — hidden Markov models, conditional random fields, support vector machines — and a generation of linguists had patiently hand-engineered features: part-of-speech tags, parse trees, named-entity lists, stemming rules, stop-word lists. A good NLP system for, say, Hindi–English translation required a small research group to build over several years.

And yet, a fundamental problem sat unsolved: computers had no intuitive sense of meaning.

The one-hot dark ages

If you wanted to feed a word into a neural network, you had to turn it into numbers somehow. The default choice was a one-hot vector: a vector with one 1 and the rest 0s, one position per word in the vocabulary.

If your vocabulary had 10,000 words:

"chai"       = [0, 0, 0, ..., 1, 0, 0, 0, 0, 0]   (position 4,271)
"coffee"     = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]     (position 2)
"submarine"  = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]     (position 0)

From the computer’s point of view, all three were equally distinct. Chai and coffee were as different as chai and submarine. Every pair of words had the same “distance”.

That’s clearly wrong. Humans know chai and coffee are related — both are hot drinks, both caffeinated, often served together at a chai ki tapri. Computers knew nothing of this. They had no representation of meaning.

The workaround: count-based methods

In the 1990s and 2000s, researchers tried to fix this with count-based word vectors. The idea: for each word, look at the other words that tend to appear near it, and use those co-occurrence counts as a vector.

For example, “chai” appears often with words like “milk”, “sugar”, “adda”, “cutting”, “sip”, “kadak”. “Coffee” appears with “bean”, “filter”, “espresso”, “cappuccino”, “sip”, “cup”. They share “sip” and “cup” — so their count vectors are closer than “chai” and “submarine” would be.

This worked, sort of. But:

The vectors were enormous (dimension = vocabulary size, often 100,000+).
They were sparse (mostly zeros).
They didn’t capture subtle relationships like analogies (“king is to queen as man is to woman”).
Building them required scanning the entire corpus for every update — slow.

The most sophisticated version of this approach was latent semantic analysis (LSA): take the big count matrix, apply singular value decomposition, and keep the top few hundred dimensions. It helped. But it was still count-based, still slow, and didn’t really learn meaning — it just compressed co-occurrence statistics.

The neural hint

In 2003, Yoshua Bengio and colleagues published “A Neural Probabilistic Language Model”. They trained a neural network to predict the next word given the previous few, and in the process the network learned a dense representation of each word — a “distributed representation” in the language of the time. The word vectors were small (around 100 dimensions), dense, and captured some semantic structure.

Bengio’s paper was beautiful, but slow. Training required hours or days for even a small vocabulary. For real applications it was impractical, and most NLP labs kept using count-based methods or bag- of-words.

Enter Mikolov, Google, and the skip-gram trick

Between 2010 and 2013, a Czech researcher named Tomas Mikolov — who had worked on RNN language models during his PhD — joined Google Brain. He and his colleagues asked a practical question: Can we get word vectors like Bengio’s, but 100× faster?

They simplified and simplified. They threw out the hidden layers. They used a trick called negative sampling. They trained not on sentence prediction but on a much simpler “predict the neighbours” task.

The result was published in January 2013, as a workshop paper (not even a full conference paper) at ICLR: Efficient Estimation of Word Representations in Vector Space. A second paper later that year — Distributed Representations of Words and Phrases and their Compositionality — polished the method with negative sampling and sub-sampling.

Along with the paper, they released the word2vec C code and a pretrained set of 300-dimensional vectors for 3 million English words, trained on 100 billion words of Google News. For free. Anybody could download it and start using it in their own system.

Why this landed

Three things made Word2Vec a watershed moment:

Speed. Training went from days to hours. Small labs could join in.
Quality. The famous king − man + woman ≈ queen demonstration — simple to verify, impossible to ignore. The vectors visibly knew things.
Availability. The pretrained vectors meant you didn’t have to train at all. You could grab them and plug them into any NLP pipeline. A 2013 search engineer in Bengaluru could download the same vectors Google used in-house.

The paper kicked off the “representation learning” era in NLP. Within two years, Word2Vec-style embeddings were the standard first step in every deep-learning NLP system. By 2018, contextual embeddings (BERT, Paper 11) would replace them — but Word2Vec laid the foundation.

One cultural note

It’s worth pausing on how unglamorous the mechanism is. Word2Vec trains a network on a fake task (predict a word given its neighbours) that nobody actually needs to do. The network learns to do it reasonably well. And then you throw the network away and keep the side-effect — the learned weights. Those weights, which were never the goal, turn out to encode meaning.

This is a recurring pattern in deep learning: the explicit objective is often a proxy. The interesting thing is what the network happens to learn while achieving it. Keep this in mind — you’ll see it again in Paper 08 (Transformer) and Paper 10 (GPT-1).

Next: the problem — why one-hot vectors fail.