Goto

Collaborating Authors

 ib curve


Label Smoothing is a Pragmatic Information Bottleneck

Kudo, Sota

arXiv.org Artificial Intelligence

This study revisits label smoothing via a form of information bottleneck. Under the assumption of sufficient model flexibility and no conflicting labels for the same input, we theoretically and experimentally demonstrate that the model output obtained through label smoothing explores the optimal solution of the information bottleneck. Based on this, label smoothing can be interpreted as a practical approach to the information bottleneck, enabling simple implementation. As an information bottleneck method, we experimentally show that label smoothing also exhibits the property of being insensitive to factors that do not contain information about the target, or to factors that provide no additional information about it when conditioned on another variable.


Flexible Variational Information Bottleneck: Achieving Diverse Compression with a Single Training

Kudo, Sota, Ono, Naoaki, Kanaya, Shigehiko, Huang, Ming

arXiv.org Artificial Intelligence

Information Bottleneck (IB) is a widely used framework that enables the extraction of information related to a target random variable from a source random variable. In the objective function, IB controls the trade-off between data compression and predictiveness through the Lagrange multiplier $\beta$. Traditionally, to find the trade-off to be learned, IB requires a search for $\beta$ through multiple training cycles, which is computationally expensive. In this study, we introduce Flexible Variational Information Bottleneck (FVIB), an innovative framework for classification task that can obtain optimal models for all values of $\beta$ with single, computationally efficient training. We theoretically demonstrate that across all values of reasonable $\beta$, FVIB can simultaneously maximize an approximation of the objective function for Variational Information Bottleneck (VIB), the conventional IB method. Then we empirically show that FVIB can learn the VIB objective as effectively as VIB. Furthermore, in terms of calibration performance, FVIB outperforms other IB and calibration methods by enabling continuous optimization of $\beta$. Our codes are available at https://github.com/sotakudo/fvib.


Adversarial Information Bottleneck

Zhai, Penglong, Zhang, Shihua

arXiv.org Machine Learning

The information bottleneck (IB) principle has been adopted to explain deep learning in terms of information compression and prediction, which are balanced by a trade-off hyperparameter. How to optimize the IB principle for better robustness and figure out the effects of compression through the trade-off hyperparameter are two challenging problems. Previous methods attempted to optimize the IB principle by introducing random noise into learning the representation and achieved state-of-the-art performance in the nuisance information compression and semantic information extraction. However, their performance on resisting adversarial perturbations is far less impressive. To this end, we propose an adversarial information bottleneck (AIB) method without any explicit assumptions about the underlying distribution of the representations, which can be optimized effectively by solving a Min-Max optimization problem. Numerical experiments on synthetic and real-world datasets demonstrate its effectiveness on learning more invariant representations and mitigating adversarial perturbations compared to several competing IB methods. In addition, we analyse the adversarial robustness of diverse IB methods contrasting with their IB curves, and reveal that IB models with the hyperparameter $\beta$ corresponding to the knee point in the IB curve achieve the best trade-off between compression and prediction, and has best robustness against various attacks.


The Convex Information Bottleneck Lagrangian

Gálvez, Borja Rodríguez, Thobaben, Ragnar, Skoglund, Mikael

arXiv.org Machine Learning

The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations T of some random variable X for the task of predicting Y. It is defined as a constrained optimization problem which maximizes the information the representation has about the task, I(T;Y), while ensuring that a minimum level of compression r is achieved; i.e., I(X;T) <= r. For practical reasons the problem is usually solved by maximizing the IB Lagrangian for many values of the Lagrange multiplier, therefore drawing the IB curve (i.e., the curve of maximal I(T;Y) for a given I(X;T)) and selecting the representation of desired predictability and compression. It is known when Y is a deterministic function of X, the IB curve cannot be explored, and other Lagrangians have been proposed to tackle this problem; e.g., the squared IB Lagrangian. In this paper, we (i) present a general family of Lagrangians which allow for the exploration of the IB curve in all scenarios; and (ii) prove that if these Lagrangians are used, there is a (and we know the) one-to-one mapping between the Lagrange multiplier and the desired compression rate r for known IB curve shapes, hence, freeing us from the burden of solving the optimization problem for many values of the Lagrange multiplier. That is, we can solve the original constrained problem with a single optimization.


Pathologies in information bottleneck for deterministic supervised learning

Kolchinsky, Artemy, Tracey, Brendan D., Van Kuyk, Steven

arXiv.org Machine Learning

Information bottleneck (IB) is a method for extracting information from one random variable X that is relevant for predicting another random variable Y . To do so, IB identifies an intermediate "bottleneck" variable T that has low mutual information I(X; T) and high mutual information I(Y; T). The IB curve characterizes the set of bottleneck variables that achieve maximal I(Y; T) for a given I(X; T), and is typically explored by optimizing the IB Lagrangian, I(Y; T) βI(X; T). Recently, there has been interest in applying IB to supervised learning, particularly for classification problems that use neural networks. In most classification problems, the output class Y is a deterministic function of the input X, which we refer to as "deterministic supervised learning". We demonstrate three pathologies that arise when IB is used in any scenario where Y is a deterministic function of X: (1) the IB curve cannot be recovered by optimizing the IB Lagrangian for different values of β; (2) there are "uninteresting" solutions at all points of the IB curve; and (3) for classifiers that achieve low error rates, the activity of different hidden layers will not exhibit a strict tradeoff between compression and prediction, contrary to a recent proposal. To address problem (1), we propose a functional that, unlike the IB Lagrangian, can recover the IB curve in all cases. We finish by demonstrating these issues on the MNIST dataset. The information bottleneck (IB) method (Tishby et al., 1999) provides a principled way to extract information that is present in one variable that is relevant for predicting another variable. Given two random variables X and Y, IB posits a "bottleneck" variable T that obeys the Markov condition Y X T . By the data processing inequality (DPI) (Cover & Thomas, 2012), this Markov condition implies that I(X; T) I(Y; T), meaning the bottleneck variable cannot contain more information about Y than it does about X.


Gaussian Lower Bound for the Information Bottleneck Limit

Painsky, Amichai, Tishby, Naftali

arXiv.org Machine Learning

The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-off between representation complexity and its predictive power. Specifically, it is achieved by minimizing the level of mutual information (MI) between the representation and the original variables, subject to a minimal level of MI between the representation and the target. This problem is shown to be in general NP hard. One important exception is the multivariate Gaussian case, for which the Gaussian IB (GIB) is known to obtain an analytical closed form solution, similar to Canonical Correlation Analysis (CCA). In this work we introduce a Gaussian lower bound to the IB curve; we find an embedding of the data which maximizes its "Gaussian part", on which we apply the GIB. This embedding provides an efficient (and practical) representation of any arbitrary data-set (in the IB sense), which in addition holds the favorable properties of a Gaussian distribution. Importantly, we show that the optimal Gaussian embedding is bounded from above by non-linear CCA. This allows a fundamental limit for our ability to Gaussianize arbitrary data-sets and solve complex problems by linear methods.