Supervised Fine-Tuning (SFT)

Appears in 2 papers

The first stage of RLHF.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

The first stage of RLHF. Fine-tune a pretrained model (e.g., GPT-3) on human-written examples of good behavior using standard cross-entropy loss. Result: a model that follows instructions better than the base model, but hasn't yet learned to optimize for human preferences.

As used in Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

Training a model on high-quality examples using standard cross-entropy loss. The model learns to generate outputs similar to the training examples. Used in rStar-Math to train on MCTS-generated solutions.