Training Data
The text corpus used to train a language model.
The text corpus used to train a language model. Mistral 7B was trained on 15 trillion tokens (a mix of open-source data and proprietary sources). More diverse, high-quality training data generally improves model quality. Mistral's training data details were not fully disclosed.