Optimization
Orthogonal Multi-view Analysis by Successive Approximations via Eigenvectors
Wang, Li, Zhang, Leihong, Shen, Chungen, Li, Ren-cang
We propose a unified framework for multi-view subspace learning to learn individual orthogonal projections for all views. The framework integrates the correlations within multiple views, supervised discriminant capacity, and distance preservation in a concise and compact way. It not only includes several existing models as special cases, but also inspires new novel models. To demonstrate its versatility to handle different learning scenarios, we showcase three new multi-view discriminant analysis models and two new multi-view multi-label classification ones under this framework. An efficient numerical method based on successive approximations via eigenvectors is presented to solve the associated optimization problem. The method is built upon an iterative Krylov subspace method which can easily scale up for high-dimensional datasets. Extensive experiments are conducted on various real-world datasets for multi-view discriminant analysis and multi-view multi-label classification. The experimental results demonstrate that the proposed models are consistently competitive to and often better than the compared methods that do not learn orthogonal projections.
Quickly Finding a Benign Region via Heavy Ball Momentum in Non-Convex Optimization
Wang, Jun-Kun, Abernethy, Jacob
The Heavy Ball Method (Polyak, 1964), proposed by Polyak over five decades ago, is a first-order method for optimizing continuous functions. While its stochastic counterpart has proven extremely popular in training deep networks, there are almost no known functions where deterministic Heavy Ball is provably faster than the simple and classical gradient descent algorithm in non-convex optimization. The success of Heavy Ball has thus far eluded theoretical understanding. Our goal is to address this gap, and in the present work we identify two non-convex problems where we provably show that the Heavy Ball momentum helps the iterate to enter a benign region that contains a global optimal point faster. We show that Heavy Ball exhibits simple dynamics that clearly reveal the benefit of using a larger value of momentum parameter for the problems. The first of these optimization problems is the phase retrieval problem, which has useful applications in physical science. The second of these optimization problems is the cubic-regularized minimization, a critical subroutine required by Nesterov-Polyak cubic-regularized method (Nesterov & Polyak (2006)) to find second-order stationary points in general smooth non-convex problems. Poylak's Heavy Ball method (Polyak (1964)) has been very popular in modern non-convex optimization and deep learning, and the stochastic version (a.k.a. SGD with momentum) has become the de facto algorithm for training neural nets.
Sharpness-Aware Minimization for Efficiently Improving Generalization
Foret, Pierre, Kleiner, Ariel, Mobahi, Hossein, Neyshabur, Behnam
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by the connection between geometry of the loss landscape and generalization---including a generalization bound that we prove here---we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels.
Policy Gradient with Expected Quadratic Utility Maximization: A New Mean-Variance Approach in Reinforcement Learning
Reinforcement learning (RL) and planning in Markov decision processes (MDPs) is one type of dynamic decisionmaking problem (Puterman, 1994; Bertsekas & Tsitsiklis, 1996; sut, 1998). While the typical objective is to maximize the expected cumulative reward, risk-aware decision-making has attracted attention in real-world applications, such as finance, robotics, and playing games (Geibel & Wysotzki, 2005; García & Fernández, 2015). The notion of risk in RL is related to the fact that even an optimal policy may perform poorly in some cases owing to the stochastic nature of the problem. To capture the risk, various criteria have been proposed, such as Value at Risk (Luenberger, 1998; Chow & Ghavamzadeh, 2014; Chow et al., 2017) and variance (Markowitz, 1952; Markowitz et al., 2000; Tamar et al., 2012; L.A. & Ghavamzadeh, 2013). Among them, we focus on the mean-variance tradeoff in RL problems. Typical mean-variance RL (MVRL) methods attempt to maximize the expected cumulative reward while maintaining the variance threshold (Tamar et al., 2012; L.A. & Ghavamzadeh, 2013; Prashanth & Ghavamzadeh, 2016; Xie et al., 2018; Bisi et al., 2020; Zhang et al., 2020). However, most existing MVRL methods suffer from high computational costs owing to the double sampling issue when approximating the gradient of the variance term (Tamar et al., 2012; L.A. & Ghavamzadeh, 2013; Prashanth & Ghavamzadeh, 2016). To avoid the double sampling issue, Xie et al. (2018) proposed a method based on the Legendre-Fenchel duality (Boyd & Vandenberghe, 2004). Although the method does not suffer from the double sampling issue, we cannot apply a standard policy gradient method and must use a coordinate descent algorithm.
Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties
Daley, Brett, Amato, Christopher
Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has shown that it also causes divergence on at least one convex optimization problem. We propose a novel method called Expectigrad, which adjusts stepsizes according to a per-component unweighted mean of all historical gradients and computes a bias-corrected momentum term jointly between the numerator and denominator. We prove that Expectigrad cannot diverge on every instance of the optimization problem known to cause Adam to diverge. We also establish a regret bound in the general stochastic nonconvex setting that suggests Expectigrad is less susceptible to gradient variance than existing methods are. Testing Expectigrad on several high-dimensional machine learning tasks, we find it often performs favorably to state-of-the-art methods with little hyperparameter tuning. Efficiently training deep neural networks has proven crucial for achieving state-of-the-art results in machine learning (e.g. At the core of these successes lies the backpropagation algorithm (Rumelhart et al., 1986), which provides a general procedure for computing the gradient of a loss measure with respect to the parameters of an arbitrary network. Because exact gradient computation over an entire dataset is expensive, training is often conducted using randomly sampled minibatches of data instead. Consequently, training can be modeled as a stochastic optimization problem where the loss is minimized in expectation.
A Scalable, Adaptive and Sound Nonconvex Regularizer for Low-rank Matrix Completion
Yao, Quanming, Wang, Yaqing, Kwok, James T.
Low-rank matrix completion recovers incomplete matrix using low-rank assumptions, which is popularly used in many applications. A recent trend is to use nonconvex regularizers that adaptively penalize singular values. They offer good recovery performance and have nice theoretical properties, but are computationally expensive due to repeated access to individual singular values. In this paper, based on the key insight that adaptive shrinkage on singular values improve empirical performance, we propose a new nonconvex low-rank regularizer called "nuclear norm minus Frobenius norm" regularizer, which is scalable, adaptive and sound. We first show it provably holds the adaptive shrinkage property. Further, we discover its factored form which bypasses the computation of singular values and allows fast optimization by general optimization algorithms. Stable recovery and convergence are guaranteed. Extensive experiments on both synthetic and real-world data sets show that the proposed method obtains state-of-the-art performance while being the fastest in comparison to existing methods.
The energy distance for ensemble and scenario reduction
Scenario reduction techniques are widely applied for solving sophisticated dynamic and stochastic programs, especially in energy and power systems, but also used in probabilistic forecasting, clustering and estimating generative adversarial networks (GANs). We propose a new method for ensemble and scenario reduction based on the energy distance which is a special case of the maximum mean discrepancy (MMD). We discuss the choice of energy distance in detail, especially in comparison to the popular Wasserstein distance which is dominating the scenario reduction literature. The energy distance is a metric between probability measures that allows for powerful tests for equality of arbitrary multivariate distributions or independence. Thanks to the latter, it is a suitable candidate for ensemble and scenario reduction problems. The theoretical properties and considered examples indicate clearly that the reduced scenario sets tend to exhibit better statistical properties for the energy distance than a corresponding reduction with respect to the Wasserstein distance. We show applications to a Bernoulli random walk and two real data based examples for electricity demand profiles and day-ahead electricity prices.
Bayesian Optimization with Output-Weighted Optimal Sampling
Blanchard, Antoine, Sapsis, Themistoklis
In Bayesian optimization, accounting for the importance of the output relative to the input is a crucial yet challenging exercise, as it can considerably improve the final result but often involves inaccurate and cumbersome entropy estimations. We approach the problem from the perspective of importance-sampling theory, and advocate the use of the likelihood ratio to guide the search algorithm towards regions of the input space where the objective function to be minimized assumes abnormally small values. The likelihood ratio acts as a sampling weight and can be computed at each iteration without severely deteriorating the overall efficiency of the algorithm. In particular, it can be approximated in a way that makes the approach tractable in high dimensions. The "likelihood-weighted" acquisition functions introduced in this work are found to outperform their unweighted counterparts in a number of applications.
Non-convex Optimization via Adaptive Stochastic Search for End-to-End Learning and Control
Exarchos, Ioannis, Pereira, Marcus A., Wang, Ziyi, Theodorou, Evangelos A.
In this work we propose the use of adaptive stochastic search as a building block for general, non-convex optimization operations within deep neural network architectures. Specifically, for an objective function located at some layer in the network and parameterized by some network parameters, we employ adaptive stochastic search to perform optimization over its output. This operation is differentiable and does not obstruct the passing of gradients during backpropagation, thus enabling us to incorporate it as a component in end-to-end learning. We study the proposed optimization module's properties and benchmark it against two existing alternatives on a synthetic energy-based structured prediction task, and further showcase its use in stochastic optimal control applications.
WeMix: How to Better Utilize Data Augmentation
Xu, Yi, Noy, Asaf, Lin, Ming, Qian, Qi, Li, Hao, Jin, Rong
Data augmentation is a widely used training trick in deep learning to improve the network generalization ability. Despite many encouraging results, several recent studies did point out limitations of the conventional data augmentation scheme in certain scenarios, calling for a better theoretical understanding of data augmentation. In this work, we develop a comprehensive analysis that reveals pros and cons of data augmentation. The main limitation of data augmentation arises from the data bias, i.e. the augmented data distribution can be quite different from the original one. This data bias leads to a suboptimal performance of existing data augmentation methods. To this end, we develop two novel algorithms, termed "AugDrop" and "MixLoss", to correct the data bias in the data augmentation. Our theoretical analysis shows that both algorithms are guaranteed to improve the effect of data augmentation through the bias correction, which is further validated by our empirical studies. Finally, we propose a generic algorithm "WeMix" by combining AugDrop and MixLoss, whose effectiveness is observed from extensive empirical evaluations.