Benchmark
Standardised tests for evaluating language model quality (e.g., MMLU for general reasoning, GSM8k for math, HumanEval for coding).
Standardised tests for evaluating language model quality (e.g., MMLU for general reasoning, GSM8k for math, HumanEval for coding). Mistral 7B outperforms LLaMA 2 13B on most benchmarks, surprising many who assumed bigger = better. Benchmarks don't capture all aspects of model quality, but they provide objective comparison points.