Yang, En-Hui
Leveraging Conditional Mutual Information to Improve Large Language Model Fine-Tuning For Classification
Sivakaran, Thanushon, Yang, En-Hui
Although large language models (LLMs) have demonstrated remarkable capabilities in recent years, the potential of information theory (IT) to enhance LLM development remains underexplored. This paper introduces the information theoretic principle of Conditional Mutual Information (CMI) to LLM fine-tuning for classification tasks, exploring its promise in two main ways: minimizing CMI to improve a model's standalone performance and maximizing CMI to enhance knowledge distillation (KD) for more capable student models. To apply CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, which was initially developed for image classification, with some modification. By minimizing CMI during LLM fine-tuning, we achieve superior performance gains on 6 of 8 GLUE classification tasks compared to BERT. Additionally, maximizing CMI during the KD process results in significant performance improvements in 6 of 8 GLUE classification tasks compared to DistilBERT. These findings demonstrate CMI's adaptability for optimizing both standalone LLMs and student models, showcasing its potential as a robust framework for advancing LLM fine-tuning. Our work bridges the gap between information theory and LLM development, offering new insights for building high-performing language models.
Conditional Mutual Information Based Diffusion Posterior Sampling for Solving Inverse Problems
Hamidi, Shayan Mohajer, Yang, En-Hui
Inverse problems are prevalent across various disciplines in science and engineering. In the field of computer vision, tasks such as inpainting, deblurring, and super-resolution are commonly formulated as inverse problems. Recently, diffusion models (DMs) have emerged as a promising approach for addressing noisy linear inverse problems, offering effective solutions without requiring additional task-specific training. Specifically, with the prior provided by DMs, one can sample from the posterior by finding the likelihood. Since the likelihood is intractable, it is often approximated in the literature. However, this approximation compromises the quality of the generated images. To overcome this limitation and improve the effectiveness of DMs in solving inverse problems, we propose an information-theoretic approach. Specifically, we maximize the conditional mutual information $\mathrm{I}(\boldsymbol{x}_0; \boldsymbol{y} | \boldsymbol{x}_t)$, where $\boldsymbol{x}_0$ represents the reconstructed signal, $\boldsymbol{y}$ is the measurement, and $\boldsymbol{x}_t$ is the intermediate signal at stage $t$. This ensures that the intermediate signals $\boldsymbol{x}_t$ are generated in a way that the final reconstructed signal $\boldsymbol{x}_0$ retains as much information as possible about the measurement $\boldsymbol{y}$. We demonstrate that this method can be seamlessly integrated with recent approaches and, once incorporated, enhances their performance both qualitatively and quantitatively.
Enhancing Diffusion Models for Inverse Problems with Covariance-Aware Posterior Sampling
Hamidi, Shayan Mohajer, Yang, En-Hui
Inverse problems exist in many disciplines of science and engineering. In computer vision, for example, tasks such as inpainting, deblurring, and super resolution can be effectively modeled as inverse problems. Recently, denoising diffusion probabilistic models (DDPMs) are shown to provide a promising solution to noisy linear inverse problems without the need for additional task specific training. Specifically, with the prior provided by DDPMs, one can sample from the posterior by approximating the likelihood. In the literature, approximations of the likelihood are often based on the mean of conditional densities of the reverse process, which can be obtained using Tweedie formula. To obtain a better approximation to the likelihood, in this paper we first derive a closed form formula for the covariance of the reverse process. Then, we propose a method based on finite difference method to approximate this covariance such that it can be readily obtained from the existing pretrained DDPMs, thereby not increasing the complexity compared to existing approaches. Finally, based on the mean and approximated covariance of the reverse process, we present a new approximation to the likelihood. We refer to this method as covariance-aware diffusion posterior sampling (CA-DPS). Experimental results show that CA-DPS significantly improves reconstruction performance without requiring hyperparameter tuning. The code for the paper is put in the supplementary materials.
Knowledge Distillation Based on Transformed Teacher Matching
Zheng, Kaixiang, Yang, En-Hui
As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.
Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information
Ye, Linfeng, Hamidi, Shayan Mohajer, Tan, Renhao, Yang, En-Hui
It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72% when 5% of the training samples are available to the student (few-shot), and increases from 0% to as high as 84% for an omitted class (zero-shot). Knowledge distillation (Buciluว et al., 2006; Hinton et al., 2015) (KD) has received tremendous attention from both academia and industry in recent years as a highly effective model compression technique, and has been deployed in different settings (Radosavovic et al., 2018; Furlanello et al., 2018; Xie et al., 2020). The crux of KD is to distill the knowledge of a cumbersome model (teacher) into a lightweight model (student). One critical component of KD that has received relatively little attention is the training of the teacher model. In fact, in most of the existing KD methods, the teacher is trained to maximize its own performance, even though this does not necessarily lead to an improvement in the student's performance (Cho & Hariharan, 2019; Mirzadeh et al., 2020).
AdaFed: Fair Federated Learning via Adaptive Common Descent Direction
Hamidi, Shayan Mohajer, Yang, En-Hui
Federated learning (FL) is a promising technology via which some edge devices/clients collaboratively train a machine learning model orchestrated by a server. Learning an unfair model is known as a critical problem in federated learning, where the trained model may unfairly advantage or disadvantage some of the devices. To tackle this problem, in this work, we propose AdaFed. The goal of AdaFed is to find an updating direction for the server along which (i) all the clients' loss functions are decreasing; and (ii) more importantly, the loss functions for the clients with larger values decrease with a higher rate. AdaFed adaptively tunes this common direction based on the values of local gradients and loss functions. We validate the effectiveness of AdaFed on a suite of federated datasets, and demonstrate that AdaFed outperforms state-of-the-art fair FL methods.
Conditional Mutual Information Constrained Deep Learning for Classification
Yang, En-Hui, Hamidi, Shayan Mohajer, Ye, Linfeng, Tan, Renhao, Yang, Beverly
The concepts of conditional mutual information (CMI) and normalized conditional mutual information (NCMI) are introduced to measure the concentration and separation performance of a classification deep neural network (DNN) in the output probability distribution space of the DNN, where CMI and the ratio between CMI and NCMI represent the intra-class concentration and inter-class separation of the DNN, respectively. By using NCMI to evaluate popular DNNs pretrained over ImageNet in the literature, it is shown that their validation accuracies over ImageNet validation data set are more or less inversely proportional to their NCMI values. Based on this observation, the standard deep learning (DL) framework is further modified to minimize the standard cross entropy function subject to an NCMI constraint, yielding CMI constrained deep learning (CMIC-DL). A novel alternating learning algorithm is proposed to solve such a constrained optimization problem. Extensive experiment results show that DNNs trained within CMIC-DL outperform the state-of-the-art models trained within the standard DL and other loss functions in the literature in terms of both accuracy and robustness against adversarial attacks. In addition, visualizing the evolution of learning process through the lens of CMI and NCMI is also advocated.