REDUCR: Robust Data Downsampling Using Class Priority Reweighting
Bankes, William, Hughes, George, Bogunovic, Ilija, Wang, Zi
–arXiv.org Artificial Intelligence
Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.
arXiv.org Artificial Intelligence
Dec-1-2023
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Portugal > Braga
- Braga (0.04)
- Slovenia > Drava
- Municipality of Benedikt > Benedikt (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Portugal > Braga
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Education > Educational Setting
- Online (0.48)
- Information Technology (0.46)
- Education > Educational Setting
- Technology: