FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

Xu, Liangyu, Zhang, Xuemiao, Duan, Feiyu, Wang, Sirui, Wang, Jingang, Cai, Xunliang

Feb-17-2025–arXiv.org Artificial Intelligence

Selecting high-quality data can significantly improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques and single-quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Experiments on the SlimPajama dataset reveal that FIRE outperforms other data selection methods and significantly enhances the pretrained model across a wide range of downstream tasks, with a 2.9% average performance improvement over Random and reducing the FLOPs necessary to achieve a certain performance level by more than half.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-17-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.92)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Energy > Oil & Gas (0.93)

Technology:
- Information Technology
  - Data Science > Data Quality (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Machine Learning (1.00)
    - Natural Language > Large Language Model (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found