Decoder-only Transformer

Appears in 2 papers

A Transformer that uses only the decoder stack (masked self-attention + feed-forward), without the encoder-decoder cross-attention.

As used in Paper 10 — Improving Language Understanding by Generative Pre-Training →

A Transformer that uses only the decoder stack (masked self-attention + feed-forward), without the encoder-decoder cross-attention. GPT-1, GPT-2, GPT-3, LLaMA, and Claude all use decoder-only architectures.

As used in Paper 12 — Language Models are Few-Shot Learners →

A Transformer architecture that uses only the decoder stack (causal self-attention layers). Each token can attend to all previous tokens but not future tokens. Used for generation tasks like language modeling. Contrast with BERT, which is encoder-only and uses bidirectional attention.