SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning

May-27-2025, 13:24:33 GMT–Neural Information Processing Systems

The pre-trained Large Language Models (LLMs) can be adapted for many downstream tasks and tailored to align with human preferences through fine-tuning. Recent studies have discovered that LLMs can achieve desirable performance with only a small amount of high-quality data, suggesting that a large portion of the data in these extensive datasets is redundant or even harmful. Identifying high-quality data from vast datasets to curate small yet effective datasets has emerged as a critical challenge. In this paper, we introduce SHED, an automated dataset refinement framework based on Shapley value for instruction fine-tuning. SHED eliminates the need for human intervention or the use of commercial LLMs.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

May-27-2025, 13:24:33 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report (0.61)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)