Nielsen, Frank
Fast proxy centers for Jeffreys centroids: The Jeffreys-Fisher-Rao and the inductive Gauss-Bregman centers
Nielsen, Frank
The symmetric Kullback-Leibler centroid also called the Jeffreys centroid of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks including information retrieval, information fusion, and clustering in image, video and sound processing. However, the Jeffreys centroid is not available in closed-form for sets of categorical or normal distributions, two widely used statistical models, and thus need to be approximated numerically in practice. In this paper, we first propose the new Jeffreys-Fisher-Rao center defined as the Fisher-Rao midpoint of the sided Kullback-Leibler centroids as a plug-in replacement of the Jeffreys centroid. This Jeffreys-Fisher-Rao center admits a generic formula for uni-parameter exponential family distributions, and closed-form formula for categorical and normal distributions, matches exactly the Jeffreys centroid for same-mean normal distributions, and is experimentally observed in practice to be close to the Jeffreys centroid. Second, we define a new type of inductive centers generalizing the principle of Gauss arithmetic-geometric double sequence mean for pairs of densities of any given exponential family. This center is shown experimentally to approximate very well the Jeffreys centroid and is suggested to use when the Jeffreys-Fisher-Rao center is not available in closed form. Moreover, this Gauss-Bregman inductive center always converges and matches the Jeffreys centroid for sets of same-mean normal distributions. We report on our experiments demonstrating the use of the Jeffreys-Fisher-Rao and Gauss-Bregman centers instead of the Jeffreys centroid. Finally, we conclude this work by reinterpreting these fast proxy centers of Jeffreys centroids under the lens of dually flat spaces in information geometry.
pyBregMan: A Python library for Bregman Manifolds
Nielsen, Frank, Soen, Alexander
A Bregman manifold is a synonym for a dually flat space in information geometry which admits as a canonical divergence a Bregman divergence. Bregman manifolds are induced by smooth strictly convex functions like the cumulant or partition functions of regular exponential families, the negative entropy of mixture families, or the characteristic functions of regular cones just to list a few such convex Bregman generators. We describe the design of pyBregMan, a library which implements generic operations on Bregman manifolds and instantiate several common Bregman manifolds used in information sciences. At the core of the library is the notion of Legendre-Fenchel duality inducing a canonical pair of dual potential functions and dual Bregman divergences. The library also implements the Fisher-Rao manifolds of categorical/multinomial distributions and multivariate normal distributions. To demonstrate the use of the pyBregMan kernel manipulating those Bregman and Fisher-Rao manifolds, the library also provides several core algorithms for various applications in statistics, machine learning, information fusion, and so on.
A Rate-Distortion View of Uncertainty Quantification
Apostolopoulou, Ifigeneia, Eysenbach, Benjamin, Nielsen, Frank, Dubrawski, Artur
In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction. While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching deep neural networks with this property. Building on prior information bottleneck approaches, our method learns a codebook that stores a compressed representation of all inputs seen during training. The distance of a new example from this codebook can serve as an uncertainty estimate for the example. The resulting model is simple to train and provides deterministic uncertainty estimates by a single forward pass. Finally, our method achieves better out-of-distribution (OOD) detection and misclassification prediction than prior methods, including expensive ensemble methods, deep kernel Gaussian Processes, and approaches based on the standard information bottleneck.
Tempered Calculus for ML: Application to Hyperbolic Model Embedding
Nock, Richard, Amid, Ehsan, Nielsen, Frank, Soen, Alexander, Warmuth, Manfred K.
Most mathematical distortions used in ML are fundamentally integral in nature: $f$-divergences, Bregman divergences, (regularized) optimal transport distances, integral probability metrics, geodesic distances, etc. In this paper, we unveil a grounded theory and tools which can help improve these distortions to better cope with ML requirements. We start with a generalization of Riemann integration that also encapsulates functions that are not strictly additive but are, more generally, $t$-additive, as in nonextensive statistical mechanics. Notably, this recovers Volterra's product integral as a special case. We then generalize the Fundamental Theorem of calculus using an extension of the (Euclidean) derivative. This, along with a series of more specific Theorems, serves as a basis for results showing how one can specifically design, alter, or change fundamental properties of distortion measures in a simple way, with a special emphasis on geometric- and ML-related properties that are the metricity, hyperbolicity, and encoding. We show how to apply it to a problem that has recently gained traction in ML: hyperbolic embeddings with a "cheap" and accurate encoding along the hyperbolic vs Euclidean scale. We unveil a new application for which the Poincar\'e disk model has very appealing features, and our theory comes in handy: \textit{model} embeddings for boosted combinations of decision trees, trained using the log-loss (trees) and logistic loss (combinations).
Divergences induced by dual subtractive and divisive normalizations of exponential families and their convex deformations
Nielsen, Frank
Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning among others. An exponential family can either be normalized subtractively by its cumulant or free energy function or equivalently normalized divisively by its partition function. Both subtractive and divisive normalizers are strictly convex and smooth functions inducing pairs of Bregman and Jensen divergences. It is well-known that skewed Bhattacharryya distances between probability densities of an exponential family amounts to skewed Jensen divergences induced by the cumulant function between their corresponding natural parameters, and in limit cases that the sided Kullback-Leibler divergences amount to reverse-sided Bregman divergences. In this paper, we first show that the $\alpha$-divergences between unnormalized densities of an exponential family amounts to scaled $\alpha$-skewed Jensen divergences induced by the partition function. We then show how comparative convexity with respect to a pair of quasi-arithmetic means allows to deform both convex functions and their arguments, and thereby define dually flat spaces with corresponding divergences when ordinary convexity is preserved.
Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning
Lin, Wu, Duruisseaux, Valentin, Leok, Melvin, Nielsen, Frank, Khan, Mohammad Emtiyaz, Schmidt, Mark
Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL
The Tempered Hilbert Simplex Distance and Its Application To Non-linear Embeddings of TEMs
Amid, Ehsan, Nielsen, Frank, Nock, Richard, Warmuth, Manfred K.
Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we introduce three different parameterizations of finite discrete TEMs via Legendre functions of the negative tempered entropy function. In particular, we establish an isometry between such parameterizations in terms of a generalization of the Hilbert log cross-ratio simplex distance to a tempered Hilbert co-simplex distance. Similar to the Hilbert geometry, the tempered Hilbert distance is characterized as a $t$-symmetrization of the oriented tempered Funk distance. We motivate our construction by introducing the notion of $t$-lengths of smooth curves in a tautological Finsler manifold. We then demonstrate the properties of our generalized structure in different settings and numerically examine the quality of its differentiable approximations for optimization in machine learning settings.
Quasi-Arithmetic Mixtures, Divergence Minimization, and Bregman Information
Brekelmans, Rob, Nielsen, Frank
Markov Chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior work has constructed annealing paths using quasi-arithmetic means, and interpreted the resulting intermediate densities as minimizing an expected divergence to the endpoints. We provide a comprehensive analysis of this 'centroid' property using Bregman divergences under a monotonic embedding of the density function, thereby associating common divergences such as Amari's and Renyi's ${\alpha}$-divergences, ${(\alpha,\beta)}$-divergences, and the Jensen-Shannon divergence with intermediate densities along an annealing path. Our analysis highlights the interplay between parametric families, quasi-arithmetic means, and divergence functions using the rho-tau Bregman divergence framework of Zhang 2004,2013.
Non-linear Embeddings in Hilbert Simplex Geometry
Nielsen, Frank, Sun, Ke
A key technique of machine learning and computer vision is to embed discrete weighted graphs into continuous spaces for further downstream processing. Embedding discrete hierarchical structures in hyperbolic geometry has proven very successful since it was shown that any weighted tree can be embedded in that geometry with arbitrary low distortion. Various optimization methods for hyperbolic embeddings based on common models of hyperbolic geometry have been studied. In this paper, we consider Hilbert geometry for the standard simplex which is isometric to a vector space equipped with the variation polytope norm. We study the representation power of this Hilbert simplex geometry by embedding distance matrices of graphs. Our findings demonstrate that Hilbert simplex geometry is competitive to alternative geometries such as the Poincar\'e hyperbolic ball or the Euclidean geometry for embedding tasks while being fast and numerically robust.
Fisher-Rao distance and pullback SPD cone distances between multivariate normal distributions
Nielsen, Frank
Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor imaging, structure tensor computer vision, radar signal processing, machine learning, just to name a few. In order to process those normal data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notions of dissimilarities between normals and paths joining them. The Fisher-Rao distance defined as the Riemannian geodesic distance induced by the Fisher information metric is such a principled metric distance which however is not known in closed-form excepts for a few particular cases. In this work, we first report a fast and robust method to approximate arbitrarily finely the Fisher-Rao distance between multivariate normal distributions. Second, we introduce a class of distances based on diffeomorphic embeddings of the normal manifold into a submanifold of the higher-dimensional symmetric positive-definite cone corresponding to the manifold of centered normal distributions. We show that the projective Hilbert distance on the cone yields a metric on the embedded normal submanifold and we pullback that cone distance with its associated straight line Hilbert cone geodesics to obtain a distance and smooth paths between normal distributions. Compared to the Fisher-Rao distance approximation, the pullback Hilbert cone distance is computationally light since it requires to compute only the extreme minimal and maximal eigenvalues of matrices. Finally, we show how to use those distances in clustering tasks.