Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

Cho, Yeseul, Shin, Baekrok, Kang, Changmin, Yun, Chulhee

Feb-9-2025–arXiv.org Artificial Intelligence

Advancements in deep learning have been significantly driven by large-scale datasets. However, recent studies have revealed a power-law relationship between the generalization capacity of deep neural networks and the size of their training data (Gordon et al., 2021; Hestness et al., 2017; Rosenfeld et al., 2019), meaning that the improvement of model performance becomes increasingly cost-inefficient as we scale up the dataset size. Fortunately, Sorscher et al. (2022) demonstrate that the power-law scaling of error can be reduced to exponential scaling with Pareto optimal data pruning. The main goal of dataset pruning is to identify and retain the most informative samples while discarding redundant data points for training neural networks. This approach can alleviate storage and computational costs as well as training efficiency. However, many existing pruning methods require training a model with a full dataset over a number of epochs to measure the importance of each sample, which ironically makes the pruning process more expensive than just training the model once on the original large dataset. For instance, several score-based methods (Gordon et al., 2021; He et al., 2024; Pleiss et al., 2020; Toneva et al., 2018; Zhang et al., 2024) require training as they utilize the dynamics from the whole training process. Some geometry-based methods, (Xia et al., 2022; Yang et al., 2024) leverage features from the penultimate layer of the trained model, therefore training a model is Authors contributed equally to this paper.

artificial intelligence, lightweight dataset pruning, machine learning, (13 more...)

arXiv.org Artificial Intelligence

Feb-9-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (0.54)
  - Statistical Learning (0.67)