AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

Zhang, Yifan, Luo, Yifan, Yuan, Yang, Yao, Andrew Chi-Chih

Feb-12-2024–arXiv.org Artificial Intelligence

To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach utilizes meta-prompted language models as zero-shot verifiers to autonomously evaluate and select high-quality mathematical content, and we release the curated open-source AutoMathText dataset encompassing over 200GB of data. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter Mistral language model on the AutoMathText dataset, achieving substantial improvements in downstream performance on the MATH dataset with a token amount reduced by orders of magnitude compared to previous continuous pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

arxiv preprint arxiv, dataset, language model, (13 more...)

arXiv.org Artificial Intelligence

Feb-12-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China > Shanghai > Shanghai (0.04)

Genre:
- Research Report (0.64)

Industry:
- Education (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)