Section 07

Limitations of GPT-3

Language Models are Few-Shot Learners 2020

Limitations of GPT-3

GPT-3 was groundbreaking, but it had real limitations. Knowing them helps understand where follow-up research went.

Limitation 1: Requires Massive Compute to Train

GPT-3’s 175 billion parameters require:

  • 5,600 GPU-years of compute (or equivalent TPU time)
  • Cost: $5–$10 million USD to train from scratch
  • Time: Months of training, even with cutting-edge hardware

This means:

  • Only big labs (OpenAI, DeepMind, Meta, Google) can train the base model.
  • Researchers in academic labs or startups cannot afford to build competing models.
  • The cost of pre-training creates a barrier to entry.

Impact: GPT-3 was released via API (paid access) rather than open-source. This limited experimentation.

Limitation 2: Hallucination and Plausible Falsehoods

GPT-3 generates fluent, grammatical text. But it sometimes generates things that sound correct but are factually wrong. This is called hallucination.

Example:

Prompt: "Name a famous Indian mathematician born in 1900."

GPT-3 might output: "Ramanujan Srinivasan, born in 1900, 
famous for work in number theory."

Reality: Srinivasa Ramanujan was born in 1887, not 1900.

Why does this happen?

  • The model learns patterns from text, not ground truth.
  • It has no access to external knowledge or a factual database.
  • When generating, it predicts the most likely next token, not the most truthful one.
  • It has memorized facts from training data, but sometimes conflates them or makes errors.

Impact: GPT-3 cannot be trusted for factual claims without verification. This limits use cases in medicine, law, finance.

Limitation 3: Sensitive to Prompt Format and Phrasing

The same task, written slightly differently, can produce very different results.

Example: Translation Task

Prompt A (works well):

Translate English to Hindi.

English: "Hello, how are you?"
Hindi: "Namaste, aap kaise hain?"

English: "I love cats."
Hindi:

Output: “Mujhe billi pasand hain.” (Correct)

Prompt B (breaks):

English to Hindi translation:
"Hello" = ?
"I love cats" = ?

Output: Might produce nonsense or code-like output.

The model is sensitive to:

  • Example format (whether examples use brackets, bullets, line breaks)
  • Number of examples (2 examples vs. 5 examples)
  • Wording of instructions (“Translate” vs. “Convert”)

This sensitivity requires prompt engineering—a new skill. The best model in the world might fail if the prompt is poorly written.

Impact: Using GPT-3 well requires trial-and-error. There’s no guarantee a prompt that works for one task will work for another.

Limitation 4: Cannot Learn from Feedback Within a Session

Fine-tuning can learn: show the model its error on a task, update weights, improve. GPT-3 cannot.

Example: Correction task

Prompt: "What is 2+2?"
GPT-3 output: "5"

User: "That's wrong. 2+2=4. Now, what is 3+3?"
GPT-3 output: "7" (Still wrong, didn't learn from the correction)

GPT-3’s weights are fixed. Each prompt is independent. It cannot accumulate feedback within a conversation.

Impact: Conversations with GPT-3 can feel repetitive or stuck if the model makes an error. Users must re-explain the task every time.

(This was later improved in InstructGPT and ChatGPT, which were fine-tuned with human feedback.)

Limitation 5: Limited Context Window

GPT-3 can only attend to the last ~2,000 tokens (about 1,500 words) of input at once. If your document is longer, you must truncate it.

Example:

  • Book chapter (5,000 words) → GPT-3 only sees the last 2,000 words
  • Email thread (50 emails) → Only the most recent emails are visible
  • Code file (10,000 lines) → Only the last 2,000 lines are attended to

Impact: GPT-3 cannot reason over long documents or maintain context in very long conversations.

(Later models like GPT-4 increased this to 8,000 or 32,000+ tokens.)

Limitation 6: Struggles with Multi-Step Reasoning

GPT-3 can do single-step reasoning and retrieve facts, but multi-step logic is harder.

Example: Logic Chain

Prompt:
"All cats are animals.
Fluffy is a cat.
Therefore, is Fluffy an animal?"

GPT-3: Yes (Correct, but sometimes lucky)

Prompt (harder):
"All cats are animals.
All animals have cells.
All cells have nuclei.
Fluffy is a cat.
Therefore, does Fluffy have nuclei?"

GPT-3: Sometimes says "no" or generates confused output.

Why? The model must chain multiple logical steps. Transformers are good at pattern-matching, but pure logical reasoning (especially over many steps) is not their strength.

Impact: GPT-3 cannot reliably solve math word problems, prove theorems, or reason through complex narratives.

(Chain-of-Thought prompting and fine-tuning later improved this.)

Limitation 7: Lacks Persistent Memory

Each conversation with GPT-3 starts fresh. It cannot remember what you said in previous conversations.

Example:

Session 1:
User: "My name is Arun."
GPT-3: "Nice to meet you, Arun."

Session 2 (one hour later):
User: "What's my name?"
GPT-3: "I don't know your name. What is it?"

GPT-3 has no persistent memory across sessions.

Impact: Every conversation requires re-introducing context. Personalized applications (personal assistants, therapists, tutors) are harder to build.

Limitation 8: Computational Cost at Inference

Running GPT-3 at scale requires significant inference compute. The API charges per token (e.g., $0.002 per 1,000 tokens), which adds up for high-volume applications.

Example:

Sentiment classification: 1,000 reviews × 0.002 = $0.002
But 1 million reviews × 0.002 = $2,000

Fine-tuned models can be deployed locally (lower cost), but GPT-3 API requires internet and per-token fees.

Impact: High-volume applications prefer smaller, locally-deployed models.


Key Takeaways from This Section

  • Training cost limits who can build models at this scale.
  • Hallucination means GPT-3 cannot be trusted for facts.
  • Prompt sensitivity requires skill and trial-and-error.
  • No session learning means the model can’t adapt within a conversation.
  • Limited context (2,000 tokens) restricts the length of documents.
  • Weak reasoning on multi-step logic problems.
  • No memory across sessions.
  • High inference cost makes large-scale deployment expensive.

These limitations motivated follow-up work: InstructGPT, ChatGPT (with fine-tuning), longer context windows, and newer architectures.

Next: Section 08: Impact