GSM8K
A benchmark dataset of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning.
A benchmark dataset of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning. Problems require 2-8 reasoning steps and often include irrelevant distractors. GSM8K was a key evaluation benchmark in the CoT paper and became standard for measuring reasoning capabilities of large language models.