Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Aug-11-2025–arXiv.org Artificial Intelligence

In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our findings imply that training a classifier on the chosen optimal subset may improve model performance and increase training efficiency when combined with an efficient data reduction strategy. Furthermore, we have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese Natural Language Processing (NLP) tasks and base models, yielding insightful results for faster training and cross-lingual data reduction.

data quality, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Aug-11-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)
- Asia (1.00)

Genre:
- Instructional Material (0.93)
- Research Report > New Finding (0.48)

Technology:
- Information Technology
  - Data Science > Data Quality (1.00)
  - Artificial Intelligence
    - Natural Language > Text Processing (0.46)
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found