Feng, Lei
Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs
Wu, Zongqian, Li, Tianyu, Xu, Baoduo, Yang, Jiaying, Zhan, Mengmeng, Zhu, Xiaofeng, Feng, Lei
Deep iterative chain-of-thought (CoT) reasoning enables LLMs to tackle complex tasks by progressively activating relevant pre-trained knowledge. However, it faces challenges in ensuring continual improvement and determining a stopping criterion. In this paper, we investigate whether the relevant knowledge that contributes directly to solving the given question can be activated from the initial reasoning path, thus circumventing the need for iterative refinement. Our experiments reveal that increasing the diversity of initial reasoning paths can achieve comparable or superior performance, a concept we term \textit{breadth reasoning}. However, existing breadth reasoning approaches, such as self-consistency, offer limited diversity. To address this limitation, we propose a simple yet effective method that enhances reasoning breadth by integrating contextual exploration with reduced sampling randomness. Extensive experiments demonstrate that our approach significantly outperforms deep iterative reasoning. Our code is provided in https://github.com/zongqianwu/breadth.
Towards Robust Incremental Learning under Ambiguous Supervision
Wang, Rui, Xia, Mingxuan, Yao, Chang, Feng, Lei, Zhao, Junbo, Chen, Gang, Wang, Haobo
Traditional Incremental Learning (IL) targets to handle sequential fully-supervised learning problems where novel classes emerge from time to time. However, due to inherent annotation uncertainty and ambiguity, collecting high-quality annotated data in a dynamic learning system can be extremely expensive. To mitigate this problem, we propose a novel weakly-supervised learning paradigm called Incremental Partial Label Learning (IPLL), where the sequentially arrived data relate to a set of candidate labels rather than the ground truth. Technically, we develop the Prototype-Guided Disambiguation and Replay Algorithm (PGDR) which leverages the class prototypes as a proxy to mitigate two intertwined challenges in IPLL, i.e., label ambiguity and catastrophic forgetting. To handle the former, PGDR encapsulates a momentum-based pseudo-labeling algorithm along with prototype-guided initialization, resulting in a balanced perception of classes. To alleviate forgetting, we develop a memory replay technique that collects well-disambiguated samples while maintaining representativeness and diversity. By jointly distilling knowledge from curated memory data, our framework exhibits a great disambiguation ability for samples of new tasks and achieves less forgetting of knowledge. Extensive experiments demonstrate that PGDR achieves superior
Enhancing Sample Selection by Cutting Mislabeled Easy Examples
Yuan, Suqin, Feng, Lei, Han, Bo, Liu, Tongliang
Sample selection is a prevalent approach in learning with noisy labels, aiming to identify confident samples for training. Although existing sample selection methods have achieved decent results by reducing the noise rate of the selected subset, they often overlook that not all mislabeled examples harm the model's performance equally. In this paper, we demonstrate that mislabeled examples correctly predicted by the model early in the training process are particularly harmful to model performance. We refer to these examples as Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting, which introduces a recalibration step that employs the model's later training state to re-select the confident subset identified early in training, thereby avoiding misleading confidence from early learning and effectively filtering out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that our method effectively improves sample selection and model performance by reducing MEEs.
Early Stopping Against Label Noise Without Validation Data
Yuan, Suqin, Feng, Lei, Liu, Tongliang
Concretely, sparing more data for validation from training data would limit the performance of the learned model, yet insufficient validation data could result in a sub-optimal selection of the desired model. In this paper, we propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model in the presence of label noise. It works by tracking the changes in the model's predictions on the training set during the training process, aiming to halt training before the model unduly fits mislabeled data. This method is empirically supported by our observation that minimum fluctuations in predictions typically occur at the training epoch before the model excessively fits mislabeled data. Through extensive experiments, we show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels. Deep Neural Networks (DNNs) are praised for their remarkable expressive power, which allows them to uncover intricate patterns in high-dimensional data (Montufar et al., 2014; LeCun et al., 2015) and can even fit data with random labels. However, this strength, often termed Memorization (Zhang et al., 2017), can be a double-edged sword, especially when encountering label noise. When label noise exists, the inherent capability of DNNs might cause the model to fit mislabeled examples from noisy datasets, which can deteriorate its generalization performance. Specifically, when DNNs are trained on noisy datasets containing both clean and mislabeled examples, it is often observed that the test error initially decreases and subsequently increases. To prevent DNNs from overconfidently learning from mislabeled examples, many existing methods for learning with noisy labels (Xia et al., 2019; Han et al., 2020; Song et al., 2022; Huang et al., 2023) explicitly or implicitly adopted the operation of halting training before the test error increases--a strategy termed "early stopping". Early stopping relies on model selection, aiming to choose a model that aligns most closely with the true concept from a range of candidate models obtained during the training process (Mohri et al., 2018; Bai et al., 2021). To this end, leveraging hold-out validation data to pinpoint an appropriate early stopping point for model selection becomes a prevalent approach (Xu & Goodacre, 2018) in deep learning. However, this approach heavily relies on additional validation data that is usually derived by splitting the training set, thereby resulting in degraded performance due to insufficient training data.
Instance-dependent Early Stopping
Yuan, Suqin, Lin, Runqi, Feng, Lei, Han, Bo, Liu, Tongliang
In machine learning practice, early stopping has been widely used to regularize models and can save computational costs by halting the training process when the model's performance on a validation set stops improving. However, conventional early stopping applies the same stopping criterion to all instances without considering their individual learning statuses, which leads to redundant computations on instances that are already well-learned. To further improve the efficiency, we propose an Instance-dependent Early Stopping (IES) method that adapts the early stopping mechanism from the entire training set to the instance level, based on the core principle that once the model has mastered an instance, the training on it should stop. IES considers an instance as mastered if the second-order differences of its loss value remain within a small range around zero. This offers a more consistent measure of an instance's learning status compared with directly using the loss value, and thus allows for a unified threshold to determine when an instance can be excluded from further backpropagation. We show that excluding mastered instances from backpropagation can increase the gradient norms, thereby accelerating the decrease of the training loss and speeding up the training process. Extensive experiments on benchmarks demonstrate that IES method can reduce backpropagation instances by 10%-50% while maintaining or even slightly improving the test accuracy and transfer learning performance of a model.
Attribute-based Visual Reprogramming for Image Classification with CLIP
Cai, Chengyi, Ye, Zesheng, Feng, Lei, Qi, Jianzhong, Liu, Feng
Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the $k$-nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.
Rethinking Chain-of-Thought from the Perspective of Self-Training
Wu, Zongqian, Xu, Baoduo, Cui, Ruochen, Zhan, Mengmeng, Zhu, Xiaofeng, Feng, Lei
Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in large language models (LLMs). We observe that CoT shares significant similarities with self-training in terms of their learning processes. Motivated by these parallels, this paper explores the underlying relationship between CoT and self-training, demonstrating how insights from self-training can enhance CoT performance. Specifically, our study first reveals that CoT, like self-training, follows the principle of semantic entropy minimization. Leveraging this insight, we propose a novel CoT framework that incorporates two key components: (i) a task-specific prompt module designed to guide LLMs in generating high-quality initial reasoning processes, and (ii) an adaptive reasoning iteration module for progressively refining the reasoning process.
Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head
Yang, Penghui, Zong, Chen-Chen, Huang, Sheng-Jun, Feng, Lei, An, Bo
Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-ofthe-art counterparts. Despite the remarkable success of deep neural networks (DNNs) in various fields, it is a significant challenge to deploy these large models in lightweight terminals (e.g., mobile phones), particularly under the constraint of computational resources or the requirement of short inference time. To mitigate this problem, knowledge distillation (KD) (Hinton et al., 2015) is widely investigated, which aims to improve the performance of a small network (a.k.a. the "student") by leveraging the expansive knowledge of a large network (a.k.a. the "teacher") to guide the training of the student network. Traditional KD techniques focus on minimizing the disparity in the predicted probabilities between the teacher and the student, which are typically the outputs of the softmax function. Nevertheless, the transformation from logits to predictive probabilities via the softmax function may lose some underlying information.
ELU-GCN: Effectively Label-Utilizing Graph Convolutional Network
Huang, Jincheng, Mo, Yujie, Shi, Xiaoshuang, Feng, Lei, Zhu, Xiaofeng
The message-passing mechanism of graph convolutional networks (i.e., GCNs) enables label information to be propagated to a broader range of neighbors, thereby increasing the utilization of labels. However, the label information is not always effectively utilized in the traditional GCN framework. To address this issue, we propose a new two-step framework called ELU-GCN. In the first stage, ELU-GCN conducts graph learning to learn a new graph structure (\ie ELU-graph), which enables GCNs to effectively utilize label information. In the second stage, we design a new graph contrastive learning on the GCN framework for representation learning by exploring the consistency and mutually exclusive information between the learned ELU graph and the original graph. Moreover, we theoretically demonstrate that the proposed method can ensure the generalization ability of GCNs. Extensive experiments validate the superiority of the proposed method.
Bayesian-guided Label Mapping for Visual Reprogramming
Cai, Chengyi, Ye, Zesheng, Feng, Lei, Qi, Jianzhong, Liu, Feng
Visual reprogramming (VR) leverages the intrinsic capabilities of pretrained vision models by adapting their input or output interfaces to solve downstream tasks whose labels (i.e., downstream labels) might be totally different from the labels associated with the pretrained models (i.e., pretrained labels). When adapting the output interface, label mapping methods transform the pretrained labels to downstream labels by establishing a gradient-free one-to-one correspondence between the two sets of labels. However, in this paper, we reveal that one-to-one mappings may overlook the complex relationship between pretrained and downstream labels. Motivated by this observation, we propose a Bayesian-guided Label Mapping (BLM) method. BLM constructs an iteratively-updated probabilistic label mapping matrix, with each element quantifying a pairwise relationship between pretrained and downstream labels. The assignment of values to the constructed matrix is guided by Bayesian conditional probability, considering the joint distribution of the downstream labels and the labels predicted by the pretrained model on downstream samples. Experiments conducted on both pretrained vision models (e.g., ResNeXt) and vision-language models (e.g., CLIP) demonstrate the superior performance of BLM over existing label mapping methods. The success of BLM also offers a probabilistic lens through which to understand and analyze the effectiveness of VR. Our code is available at https://github.com/tmlr-group/BayesianLM.