Goto

Collaborating Authors

 etrieve




Supplementary Material Appendix T able of Contents

Neural Information Processing Systems

CIFAR10 dataset is released with an MIT license. All the datasets used in this work are publicly available. We begin by first stating and then proving Theorem 1. Theorem We introduce the notations used in the proof of the theorem in subsection B.2. Here, we discuss some prior works on submodularity. Detailed description of SSL loss formulation for different SSL algorithms are given below: C.2 Mean-T eacher Mean Teacher [56] proposed to generate a more stable target output for data points in the unlabeled C.5 FixMatch FixMatch [53] uses the cross-entropy loss between class predictions of weak augmented and strong We used various standard datasets, viz., CIFAR10, SVHN, to demonstrate the effectiveness and The descriptions of the datasets used along with the licenses are given in the Table 1. We adapt it to SSL by choosing a representative subset of unlabeled points such that the gradients are similar to the unlabeled loss gradients.



RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning

Killamsetty, Krishnateja, Zhao, Xujiang, Chen, Feng, Iyer, Rishabh

arXiv.org Artificial Intelligence

Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabeled data instead of entire unlabeled data enables the current SSL algorithms to converge faster, thereby reducing the computational costs significantly. In this work, we propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning. RETRIEVE selects the coreset by solving a mixed discrete-continuous bi-level optimization problem such that the selected coreset minimizes the labeled set loss. We use a one-step gradient approximation and show that the discrete optimization problem is approximately submodular, thereby enabling simple greedy algorithms to obtain the coreset. We empirically demonstrate on several real-world datasets that existing SSL algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve a) faster training times, b) better performance when unlabeled data consists of Out-of-Distribution(OOD) data and imbalance. More specifically, we show that with minimal accuracy degradation, RETRIEVE achieves a speedup of around 3X in the traditional SSL setting and achieves a speedup of 5X compared to state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD data.