Inductive Learning
Can We Treat Noisy Labels as Accurate?
Zheng, Yuxiang, Han, Zhongyi, Yin, Yilong, Gao, Xin, Liu, Tongliang
Noisy labels significantly hinder the accuracy and generalization of machine learning models, particularly due to ambiguous instance features. Traditional techniques that attempt to correct noisy labels directly, such as those using transition matrices, often fail to address the inherent complexities of the problem sufficiently. In this paper, we introduce EchoAlign, a transformative paradigm shift in learning from noisy labels. Instead of focusing on label correction, EchoAlign treats noisy labels ($\tilde{Y}$) as accurate and modifies corresponding instance features ($X$) to achieve better alignment with $\tilde{Y}$. EchoAlign's core components are (1) EchoMod: Employing controllable generative models, EchoMod precisely modifies instances while maintaining their intrinsic characteristics and ensuring alignment with the noisy labels. (2) EchoSelect: Instance modification inevitably introduces distribution shifts between training and test sets. EchoSelect maintains a significant portion of clean original instances to mitigate these shifts. It leverages the distinct feature similarity distributions between original and modified instances as a robust tool for accurate sample selection. This integrated approach yields remarkable results. In environments with 30% instance-dependent noise, even at 99% selection accuracy, EchoSelect retains nearly twice the number of samples compared to the previous best method. Notably, on three datasets, EchoAlign surpasses previous state-of-the-art techniques with a substantial improvement.
Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy
Hayes, Jamie, Shumailov, Ilia, Triantafillou, Eleni, Khalifa, Amr, Papernot, Nicolas
The high cost of model training makes it increasingly desirable to develop techniques for unlearning. These techniques seek to remove the influence of a training example without having to retrain the model from scratch. Intuitively, once a model has unlearned, an adversary that interacts with the model should no longer be able to tell whether the unlearned example was included in the model's training set or not. In the privacy literature, this is known as membership inference. In this work, we discuss adaptations of Membership Inference Attacks (MIAs) to the setting of unlearning (leading to their "U-MIA" counterparts). We propose a categorization of existing U-MIAs into "population U-MIAs", where the same attacker is instantiated for all examples, and "per-example U-MIAs", where a dedicated attacker is instantiated for each example. We show that the latter category, wherein the attacker tailors its membership prediction to each example under attack, is significantly stronger. Indeed, our results show that the commonly used U-MIAs in the unlearning literature overestimate the privacy protection afforded by existing unlearning techniques on both vision and language models. Our investigation reveals a large variance in the vulnerability of different examples to per-example U-MIAs. In fact, several unlearning algorithms lead to a reduced vulnerability for some, but not all, examples that we wish to unlearn, at the expense of increasing it for other examples. Notably, we find that the privacy protection for the remaining training examples may worsen as a consequence of unlearning. We also discuss the fundamental difficulty of equally protecting all examples using existing unlearning schemes, due to the different rates at which examples are unlearned. We demonstrate that naive attempts at tailoring unlearning stopping criteria to different examples fail to alleviate these issues.
Learning Intersections of Halfspaces with Distribution Shift: Improved Algorithms and SQ Lower Bounds
Klivans, Adam R., Stavropoulos, Konstantinos, Vasilyan, Arsen
Recent work of Klivans, Stavropoulos, and Vasilyan initiated the study of testable learning with distribution shift (TDS learning), where a learner is given labeled samples from training distribution $\mathcal{D}$, unlabeled samples from test distribution $\mathcal{D}'$, and the goal is to output a classifier with low error on $\mathcal{D}'$ whenever the training samples pass a corresponding test. Their model deviates from all prior work in that no assumptions are made on $\mathcal{D}'$. Instead, the test must accept (with high probability) when the marginals of the training and test distributions are equal. Here we focus on the fundamental case of intersections of halfspaces with respect to Gaussian training distributions and prove a variety of new upper bounds including a $2^{(k/\epsilon)^{O(1)}} \mathsf{poly}(d)$-time algorithm for TDS learning intersections of $k$ homogeneous halfspaces to accuracy $\epsilon$ (prior work achieved $d^{(k/\epsilon)^{O(1)}}$). We work under the mild assumption that the Gaussian training distribution contains at least an $\epsilon$ fraction of both positive and negative examples ($\epsilon$-balanced). We also prove the first set of SQ lower-bounds for any TDS learning problem and show (1) the $\epsilon$-balanced assumption is necessary for $\mathsf{poly}(d,1/\epsilon)$-time TDS learning for a single halfspace and (2) a $d^{\tilde{\Omega}(\log 1/\epsilon)}$ lower bound for the intersection of two general halfspaces, even with the $\epsilon$-balanced assumption. Our techniques significantly expand the toolkit for TDS learning. We use dimension reduction and coverings to give efficient algorithms for computing a localized version of discrepancy distance, a key metric from the domain adaptation literature.
Robust Semi-supervised Learning by Wisely Leveraging Open-set Data
Yang, Yang, Jiang, Nan, Xu, Yi, Zhan, De-Chuan
Open-set Semi-supervised Learning (OSSL) holds a realistic setting that unlabeled data may come from classes unseen in the labeled set, i.e., out-of-distribution (OOD) data, which could cause performance degradation in conventional SSL models. To handle this issue, except for the traditional in-distribution (ID) classifier, some existing OSSL approaches employ an extra OOD detection module to avoid the potential negative impact of the OOD data. Nevertheless, these approaches typically employ the entire set of open-set data during their training process, which may contain data unfriendly to the OSSL task that can negatively influence the model performance. This inspires us to develop a robust open-set data selection strategy for OSSL. Through a theoretical understanding from the perspective of learning theory, we propose Wise Open-set Semi-supervised Learning (WiseOpen), a generic OSSL framework that selectively leverages the open-set data for training the model. By applying a gradient-variance-based selection mechanism, WiseOpen exploits a friendly subset instead of the whole open-set dataset to enhance the model's capability of ID classification. Moreover, to reduce the computational expense, we also propose two practical variants of WiseOpen by adopting low-frequency update and loss-based selection respectively. Extensive experiments demonstrate the effectiveness of WiseOpen in comparison with the state-of-the-art.
"Set It Up!": Functional Object Arrangement with Compositional Generative Models
Xu, Yiqing, Mao, Jiayuan, Du, Yilun, Lozáno-Pérez, Tomas, Kaebling, Leslie Pack, Hsu, David
This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.
Towards Graph Contrastive Learning: A Survey and Beyond
Ju, Wei, Wang, Yifan, Qin, Yifang, Mao, Zhengyang, Xiao, Zhiping, Luo, Junyu, Yang, Junwei, Gu, Yiyang, Wang, Dongjie, Long, Qingqing, Yi, Siyu, Luo, Xiao, Zhang, Ming
In recent years, deep learning on graphs has achieved remarkable success in various domains. However, the reliance on annotated graph data remains a significant bottleneck due to its prohibitive cost and time-intensive nature. To address this challenge, self-supervised learning (SSL) on graphs has gained increasing attention and has made significant progress. SSL enables machine learning models to produce informative representations from unlabeled graph data, reducing the reliance on expensive labeled data. While SSL on graphs has witnessed widespread adoption, one critical component, Graph Contrastive Learning (GCL), has not been thoroughly investigated in the existing literature. Thus, this survey aims to fill this gap by offering a dedicated survey on GCL. We provide a comprehensive overview of the fundamental principles of GCL, including data augmentation strategies, contrastive modes, and contrastive optimization objectives. Furthermore, we explore the extensions of GCL to other aspects of data-efficient graph learning, such as weakly supervised learning, transfer learning, and related scenarios. We also discuss practical applications spanning domains such as drug discovery, genomics analysis, recommender systems, and finally outline the challenges and potential future directions in this field.
Class-Imbalanced Graph Learning without Class Rebalancing
Liu, Zhining, Qiu, Ruizhong, Zeng, Zhichen, Yoo, Hyunsik, Zhou, David, Xu, Zhe, Zhu, Yada, Weldemariam, Kommy, He, Jingrui, Tong, Hanghang
Class imbalance is prevalent in real-world node classification tasks and poses great challenges for graph learning models. Most existing studies are rooted in a class-rebalancing (CR) perspective and address class imbalance with class-wise reweighting or resampling. In this work, we approach the root cause of class-imbalance bias from an topological paradigm. Specifically, we theoretically reveal two fundamental phenomena in the graph topology that greatly exacerbate the predictive bias stemming from class imbalance. On this basis, we devise a lightweight topological augmentation framework BAT to mitigate the class-imbalance bias without class rebalancing. Being orthogonal to CR, BAT can function as an efficient plug-and-play module that can be seamlessly combined with and significantly boost existing CR techniques. Systematic experiments on real-world imbalanced graph learning tasks show that BAT can deliver up to 46.27% performance gain and up to 72.74% bias reduction over existing techniques. Code, examples, and documentations are available at https://github.com/ZhiningLiu1998/BAT.
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks
He, Xuanli, Xu, Qiongkai, Wang, Jun, Rubinstein, Benjamin I. P., Cohn, Trevor
Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model's behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets.
Erasing the Bias: Fine-Tuning Foundation Models for Semi-Supervised Learning
Semi-supervised learning (SSL) has witnessed remarkable progress, resulting in the emergence of numerous method variations. However, practitioners often encounter challenges when attempting to deploy these methods due to their subpar performance. In this paper, we present a novel SSL approach named FineSSL that significantly addresses this limitation by adapting pre-trained foundation models. We identify the aggregated biases and cognitive deviation problems inherent in foundation models, and propose a simple yet effective solution by imposing balanced margin softmax and decoupled label smoothing. Through extensive experiments, we demonstrate that FineSSL sets a new state of the art for SSL on multiple benchmark datasets, reduces the training cost by over six times, and can seamlessly integrate various fine-tuning and modern SSL algorithms. The source code is available at https://github.com/Gank0078/FineSSL.
Frugal Algorithm Selection
Kuş, Erdem, Akgün, Özgür, Dang, Nguyen, Miguel, Ian
When solving decision and optimisation problems, many competing algorithms (model and solver choices) have complementary strengths. Typically, there is no single algorithm that works well for all instances of a problem. Automated algorithm selection has been shown to work very well for choosing a suitable algorithm for a given instance. However, the cost of training can be prohibitively large due to running candidate algorithms on a representative set of training instances. In this work, we explore reducing this cost by choosing a subset of the training instances on which to train. We approach this problem in three ways: using active learning to decide based on prediction uncertainty, augmenting the algorithm predictors with a timeout predictor, and collecting training data using a progressively increasing timeout. We evaluate combinations of these approaches on six datasets from ASLib and present the reduction in labelling cost achieved by each option.