Improving Data Efficiency via Curating LLM-Driven Rating Systems

Pang, Jinlong, Wei, Jiaheng, Shah, Ankit Parag, Zhu, Zhaowei, Wang, Yaxuan, Qian, Chen, Liu, Yang, Bao, Yujia, Wei, Wei

Oct-9-2024–arXiv.org Artificial Intelligence

Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that "more can be less."

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-9-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Consumer Products & Services > Restaurants (0.67)
- Information Technology > Security & Privacy (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)