Weight tying

Appears in 1 paper

Using the same weight matrix for both the token input embedding and the output projection (UW and UWᵀ).

As used in Paper 10 — Improving Language Understanding by Generative Pre-Training →

Using the same weight matrix for both the token input embedding and the output projection (UW and UWᵀ). Reduces parameters and often improves performance. Used in GPT-1.

Paper 10 — Improving Language Understanding by Generative Pre-Training →

Appears in papers