Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information
–arXiv.org Artificial Intelligence
In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our findings imply that training a classifier on the chosen optimal subset may improve model performance and increase training efficiency when combined with an efficient data reduction strategy. Furthermore, we have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese Natural Language Processing (NLP) tasks and base models, yielding insightful results for faster training and cross-lingual data reduction.
arXiv.org Artificial Intelligence
Aug-11-2025
- Country:
- Asia
- China
- Hong Kong (0.04)
- Liaoning Province > Dalian (0.04)
- India > Maharashtra
- Pune (0.04)
- Middle East
- China
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Germany > Baden-Württemberg
- Stuttgart Region > Stuttgart (0.04)
- Spain
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Valencian Community > Valencia Province
- Valencia (0.04)
- Catalonia > Barcelona Province
- Belgium > Brussels-Capital Region
- North America
- Cuba (0.04)
- United States
- California
- Los Angeles County > Long Beach (0.04)
- Santa Clara County > Santa Clara (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maryland > Baltimore (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Tennessee > Davidson County
- Nashville (0.04)
- California
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Asia
- Genre:
- Instructional Material (0.93)
- Research Report > New Finding (0.48)
- Technology: