Decoder-only Transformer
A Transformer that uses only the decoder stack (masked self-attention + feed-forward), without the encoder-decoder cross-attention.
A Transformer that uses only the decoder stack (masked self-attention + feed-forward), without the encoder-decoder cross-attention. GPT-1, GPT-2, GPT-3, LLaMA, and Claude all use decoder-only architectures.
A Transformer architecture that uses only the decoder stack (causal self-attention layers). Each token can attend to all previous tokens but not future tokens. Used for generation tasks like language modeling. Contrast with BERT, which is encoder-only and uses bidirectional attention.