A Weighted K-Center Algorithm for Data Subset Selection

Ramalingam, Srikumar, Awasthi, Pranjal, Kumar, Sanjiv

arXiv.org Artificial Intelligence 

This makes us ponder the key factors behind this revolution: is it the availability of the large datasets, the actual learning algorithms, or both? ML models rely on very deep networks and enormous labeled datasets, requiring exorbitant computational and human labeling efforts. For example, the market for data annotation costs have crossed one billion US dollars in 2020, and it is estimated to hit seven billion in 2027. Human annotation of semantic segmentation labels takes about 45-60 minutes [BGC10] for a single image. In most vision and NLP applications, unlabeled data is unlimited and is usually available at no cost. To directly reduce the human annotation costs, this paper focuses on identifying smaller subsets of training data that can lead to accurate models with marginal or no loss in performance compared to the ones trained on the full dataset. Among the several approaches for subset selection, two are shown to achieve impressive results: (1) the classical margin sampling algorithm that selects points based on the uncertainty in the class prediction scores [RS06b], and (2) the k-center clustering algorithm [SS18] based on core sets. One may wonder about the natural extension of these two powerful algorithms: is there a principled method that jointly uses both these measures for computing more informative subsets?