Collaborating Authors

Empirical Perspectives on One-Shot Semi-supervised Learning Machine Learning

One of the greatest obstacles in the adoption of deep neural networks for new applications is that training the network typically requires a large number of manually labeled training samples. We empirically investigate the scenario where one has access to large amounts of unlabeled data but require labeling only a single prototypical sample per class in order to train a deep network (i.e., one-shot semi-supervised learning). Specifically, we investigate the recent results reported in FixMatch for one-shot semi-supervised learning to understand the factors that affect and impede high accuracies and reliability for one-shot semi-supervised learning of Cifar-10. For example, we discover that one barrier to one-shot semi-supervised learning for high-performance image classification is the unevenness of class accuracy during the training. These results point to solutions that might enable more widespread adoption of one-shot semi-supervised training methods for new applications.

Augment your batch: better training with larger batches Machine Learning

Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tuning hyperparameter schedules, the generalization of the model may be hampered. We propose to use batch augmentation: replicating instances of samples within the same batch with different data augmentations. Batch augmentation acts as a regularizer and an accelerator, increasing both generalization and performance scaling. We analyze the effect of batch augmentation on gradient variance and show that it empirically improves convergence for a wide variety of deep neural networks and datasets. Our results show that batch augmentation reduces the number of necessary SGD updates to achieve the same accuracy as the state-of-the-art. Overall, this simple yet effective method enables faster training and better generalization by allowing more computational resources to be used concurrently.

Gradient-based Data Augmentation for Semi-Supervised Learning Machine Learning

In semi-supervised learning (SSL), a technique called consistency regularization (CR) achieves high performance. It has been proved that the diversity of data used in CR is extremely important to obtain a model with high discrimination performance by CR. We propose a new data augmentation (Gradient-based Data Augmentation (GDA)) that is deterministically calculated from the image pixel value gradient of the posterior probability distribution that is the model output. We aim to secure effective data diversity for CR by utilizing three types of GDA. On the other hand, it has been demonstrated that the mixup method for labeled data and unlabeled data is also effective in SSL. We propose an SSL method named MixGDA by combining various mixup methods and GDA. The discrimination performance achieved by MixGDA is evaluated against the 13-layer CNN that is used as standard in SSL research. As a result, for CIFAR-10 (4000 labels), MixGDA achieves the same level of performance as the best performance ever achieved. For SVHN (250 labels, 500 labels and 1000 labels) and CIFAR-100 (10000 labels), MixGDA achieves state-of-the-art performance.

FLAG: Adversarial Data Augmentation for Graph Neural Networks Machine Learning

Data augmentation helps neural networks generalize better, but it remains an open question how to effectively augment graph data to enhance the performance of GNNs (Graph Neural Networks). While most existing graph regularizers focus on augmenting graph topological structures by adding/removing edges, we offer a novel direction to augment in the input node feature space for better performance. We propose a simple but effective solution, FLAG (Free Large-scale Adversarial Augmentation on Graphs), which iteratively augments node features with gradient-based adversarial perturbations during training, and boosts performance at test time. Empirically, FLAG can be easily implemented with a dozen lines of code and is flexible enough to function with any GNN backbone, on a wide variety of large-scale datasets, and in both transductive and inductive settings. Without modifying a model's architecture or training setup, FLAG yields a consistent and salient performance boost across both node and graph classification tasks. Using FLAG, we reach state-of-the-art performance on the large-scale ogbg-molpcba, ogbg-ppa, and ogbg-code datasets. Graph Neural Networks (GNNs) have emerged as powerful architectures for learning and analyzing graph representations. The Graph Convolutional Network (GCN) (Kipf & Welling, 2016) and its variants have been applied to a wide range of tasks, including visual recognition (Zhao et al., 2019; Shen et al., 2018), meta-learning (Garcia & Bruna, 2017), social analysis (Qiu et al., 2018; Li & Goldwasser, 2019), and recommender systems (Ying et al., 2018).

Role-Wise Data Augmentation for Knowledge Distillation Machine Learning

Knowledge Distillation (KD) is a common method for transferring the ``knowledge'' learned by one machine learning model (the \textit{teacher}) into another model (the \textit{student}), where typically, the teacher has a greater capacity (e.g., more parameters or higher bit-widths). To our knowledge, existing methods overlook the fact that although the student absorbs extra knowledge from the teacher, both models share the same input data -- and this data is the only medium by which the teacher's knowledge can be demonstrated. Due to the difference in model capacities, the student may not benefit fully from the same data points on which the teacher is trained. On the other hand, a human teacher may demonstrate a piece of knowledge with individualized examples adapted to a particular student, for instance, in terms of her cultural background and interests. Inspired by this behavior, we design data augmentation agents with distinct roles to facilitate knowledge distillation. Our data augmentation agents generate distinct training data for the teacher and student, respectively. We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student. We compare our approach with existing KD methods on training popular neural architectures and demonstrate that role-wise data augmentation improves the effectiveness of KD over strong prior approaches. The code for reproducing our results can be found at