Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
Chimoto, Everlyn Asiko, Gala, Jay, Ahia, Orevaoghene, Kreutzer, Julia, Bassett, Bruce A., Hooker, Sara
–arXiv.org Artificial Intelligence
Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.
arXiv.org Artificial Intelligence
Jun-21-2024
- Country:
- Africa > South Africa
- Western Cape > Cape Town (0.04)
- Asia
- China > Hong Kong (0.04)
- Indonesia > Bali (0.04)
- Middle East
- Israel (0.04)
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Philippines > Luzon
- National Capital Region > City of Manila (0.14)
- Singapore (0.04)
- Europe
- United Kingdom > Scotland
- City of Edinburgh > Edinburgh (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Sweden > Uppsala County
- Uppsala (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Italy > Tuscany
- Florence (0.04)
- United Kingdom > Scotland
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Washington > King County
- Seattle (0.04)
- Minnesota > Hennepin County
- Canada > Ontario
- Oceania > Australia
- Africa > South Africa
- Genre:
- Research Report > Experimental Study (0.46)
- Industry:
- Education > Instructional Theory (0.40)
- Technology: