Disentangling the Roles of Representation and Selection in Data Pruning

Du, Yupei, Song, Yingjin, Wong, Hugh Mee, Ignatev, Daniil, Gatt, Albert, Nguyen, Dong

Jul-8-2025–arXiv.org Artificial Intelligence

Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jul-8-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
- Europe
  - Netherlands (0.04)
  - Sweden > Uppsala County
    - Uppsala (0.04)
  - Slovenia > Drava
    - Municipality of Benedikt > Benedikt (0.04)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)
- Asia
  - Singapore (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning > Statistical Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found