Section 08

Limitations

Efficient Estimation of Word Representations in Vector Space (Word2Vec) 2013

8. Limitations — one vector per word, and the cracks that showed

Word2Vec was a huge leap. It was also, inevitably, an incomplete solution. Four specific limitations emerged within a year or two of the paper, and each one pointed to a follow-up paper that would later become famous.

Limitation 1 — one vector per word, forever

The most fundamental limitation is stated in the title of this section. Word2Vec gives each word exactly one vector. But many words have multiple meanings:

  • “bank” in “river bank” vs. “HDFC bank”.
  • “post” in “post office” vs. “post a photo”.
  • “bat” in “cricket bat” vs. “the bat flew into the room”.
  • “charge” in “charge the phone” vs. “charge a customer”.

In Word2Vec, all of these meanings share one vector. The vector ends up being a blurry average — close to neither meaning specifically. This is sometimes called the polysemy problem, or more vividly, the “blurred average” problem.

Why this matters

For most words it doesn’t matter much — words with similar-ish meanings still get similar-ish vectors, and downstream tasks can often cope. But for words like “bank”, “charge”, or the thousands of homographs in English (and their equivalents in every Indian language), Word2Vec has a hard ceiling on accuracy.

The fix: contextual embeddings. ELMo (2018) was the first, and BERT (Paper 11) was the definitive one. In these, a word’s vector depends on the sentence it’s in. “Bank” in “river bank” gets one vector; “bank” in “HDFC bank” gets another. Problem solved — but at the cost of running a full neural network every time you need an embedding.

Limitation 2 — it doesn’t know about morphology

Indian languages are morphologically rich. Hindi alone has dozens of verb forms for each root. Tamil has agglutination that produces word forms you’ve never seen before. Word2Vec treats every form as a separate word with a separate vector:

"chalna"   (to walk)       → one vector
"chalta"   (walks, masc.)  → another vector
"chalti"   (walks, fem.)   → another
"chalna hai" (have to walk) → another

Rare forms end up with weak vectors, because they appear less often and the training signal is thin. Word2Vec has no notion that these are all related via a shared root.

The fix: fastText (Bojanowski et al., 2016). fastText extends Word2Vec by adding character n-grams — it learns vectors for sub-word pieces like “chal-” and “-ta”, and the word vector is the sum of its sub-word vectors. Rare and morphologically complex words got a lot better. fastText is now the standard for Indian languages.

Limitation 3 — no sentence-level meaning

A Word2Vec vector is for a single word. To get a vector for a sentence, people did hacks:

  • Average the word vectors. Works surprisingly well for short, topical similarity but washes out word order, negation, and structure.
  • Weighted averages (SIF, smooth inverse frequency). Slightly better but still ignores syntax.
  • Train a separate model (doc2vec / paragraph vectors). Mikolov’s own follow-up in 2014. Works, but requires training a new model.

None of these handle negation, modification, or compositional meaning cleanly. “I didn’t like the food” and “I liked the food” have almost identical averaged vectors.

The fix: contextual, sentence-level models. This is what Transformers (Paper 08) and BERT (Paper 11) deliver. A full Transformer produces a vector per position that already encodes the surrounding context.

Limitation 4 — bias bakes in

Word2Vec learns from human text, and human text contains bias. The famous Bolukbasi et al. (2016) paper showed that Word2Vec trained on Google News learned:

  • doctor : man :: nurse : woman
  • computer programmer : man :: homemaker : woman
  • Racial and caste bias in the training corpus transferred directly into the vector geometry.

This is not a bug in Word2Vec — it’s accurately reflecting statistical patterns in the training data. But the system is then used in downstream applications (resume screening, search ranking, news recommendation) where those biases cause real harm.

Mitigations: “debiased” word embeddings, where researchers try to remove the gender direction explicitly. These help somewhat but don’t solve the problem — the biases get reintroduced downstream. This has become its own research area, and Word2Vec was the original empirical demonstration of the problem at scale.

In an Indian context: Word2Vec trained on Hindi Wikipedia reflects the gender, caste, and regional biases of Wikipedia contributors. Models deployed on Indian language data without bias audits can silently reinforce those patterns — something AI4Bharat and similar labs have been vocal about.

Limitation 5 — you can’t update it incrementally

Word2Vec is trained offline on a fixed corpus. If a new word enters the language — say, “WFH” (work from home) in 2020, or “Jio” in 2016 — the existing model doesn’t know what to do with it. You either treat new words as <UNK> or retrain the whole model.

In practice, this means Word2Vec embeddings go stale. A model trained in 2013 has no concept of “Aadhaar” (became ubiquitous 2015+), “PayTM” (exploded 2016+), “Covid” (2020+), or “Jugalbandi AI” (2024+). For many real applications this is acceptable — but it’s a real limitation compared to later models that can handle new words via subword tokenisation.

Limitation 6 — it’s shallow

Technically, Word2Vec is a one-layer model. Its entire expressive power is one matrix multiplication. Everything it “understands” about a word is compressed into that one 300-dimensional vector. It cannot do reasoning, cannot handle long-range dependencies, cannot compose meanings the way a human does.

This made it perfect for its era — 2013 hardware could train a one-layer model on billions of words. But as GPUs got faster, it became possible to train deeper models that captured much richer structure. By 2018, BERT (a 12- or 24-layer Transformer) had essentially replaced Word2Vec in high-resource settings because deeper models simply knew more.

Putting it together

Word2Vec’s limitations are each a direction pointing toward a later paper:

LimitationNext stepPaper
One vector per wordContextual embeddingsBERT (Paper 11)
No morphologySub-word vectorsfastText (not in our series)
No sentence meaningFull-sentence encodersTransformer (Paper 08)
Bias baked inFairness-aware trainingongoing research
Can’t update incrementallyTokeniser + larger modelsGPT (Paper 10)
ShallowDeep contextual networksBERT, GPT (Papers 10, 11)

This is why Word2Vec, though no longer state of the art, is still the clearest thing to learn first. Every later paper you’ll read in this series is solving one of its problems while keeping its core idea — “meaning lives in learned vector spaces” — fundamentally intact.

Next: what came after Word2Vec.