Masiha, Saeed
f-divergences and their applications in lossy compression and bounding generalization error
Masiha, Saeed, Gohari, Amin, Yassaee, Mohammad Hossein
In this paper, we provide three applications for $f$-divergences: (i) we introduce Sanov's upper bound on the tail probability of the sum of independent random variables based on super-modular $f$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $f$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $f$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $f$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the $f$-rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds.
Stochastic Second-Order Methods Improve Best-Known Sample Complexity of SGD for Gradient-Dominated Function
Masiha, Saeed, Salehkaleybar, Saber, He, Niao, Kiyavash, Negar, Thiran, Patrick
We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\le\alpha\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $\epsilon$-global optimum is $\mathcal{O}(\epsilon^{-7/(2\alpha)+1})$ for $1\le\alpha< 3/2$ and $\mathcal{\tilde{O}}(\epsilon^{-2/(\alpha)})$ for $3/2\le\alpha\le 2$. SCRN improves the best-known sample complexity of stochastic gradient descent. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the average sample complexity of SCRN can be reduced to ${\mathcal{O}}(\epsilon^{-2})$ for $\alpha=1$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.