Weight tying
Using the same weight matrix for both the token input embedding and the output projection (UW and UWᵀ).
Using the same weight matrix for both the token input embedding and the output projection (UW and UWᵀ). Reduces parameters and often improves performance. Used in GPT-1.