Fine-tuning
Adapting a pre-trained model to a specific task by continuing training on labelled task data with a small learning rate.
Adapting a pre-trained model to a specific task by continuing training on labelled task data with a small learning rate. Updates all model weights (not just the classification head).
The second phase of BERT's training. The pre-trained model is loaded and trained further on a small labelled dataset for a specific task (e.g. sentiment analysis). All parameters — including the Transformer encoder — are updated, but the learning rate is kept very small (2e-5 to 5e-5) to avoid catastrophic forgetting.
Training a pre-trained model on labeled data specific to a particular task. In traditional NLP (before GPT-3), every new task required fine-tuning. GPT-3 replaced fine-tuning with prompt engineering + in-context learning.
Training a pre-trained model on smaller amounts of labeled data specific to a downstream task. The scaling laws focus on pre-training; fine-tuning uses a different compute profile.