Goto

Collaborating Authors

 Optimization


Adaptive Sampling for Stochastic Risk-Averse Learning

arXiv.org Machine Learning

We consider the problem of training machine learning models in a risk-averse manner. In particular, we propose an adaptive sampling algorithm for stochastically optimizing the Conditional Value-at-Risk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zero-sum game between two players. Our approach solves the game using an efficient no-regret algorithm for each player. Critically, we can apply these algorithms to large-scale settings because the implementation relies on sampling from Determinantal Point Processes. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks.


Prior specification via prior predictive matching: Poisson matrix factorization and beyond

arXiv.org Machine Learning

Hyperparameter optimization for machine learning models is typically carried out by some sort of cross-validation procedure or global optimization, both of which require running the learning algorithm numerous times. We show that for Bayesian hierarchical models there is an appealing alternative that allows selecting good hyperparameters without learning the model parameters during the process at all, facilitated by the prior predictive distribution that marginalizes out the model parameters. We propose an approach that matches suitable statistics of the prior predictive distribution with ones provided by an expert and apply the general concept for matrix factorization models. For some Poisson matrix factorization models we can analytically obtain exact hyperparameters, including the number of factors, and for more complex models we propose a model-independent optimization procedure.


Structured Low-Rank Algorithms: Theory, MR Applications, and Links to Machine Learning

arXiv.org Machine Learning

In this survey, we provide a detailed review of recent advances in the recovery of continuous domain multidimensional signals from their few nonuniform (multichannel) measurements using structured low-rank matrix completion formulation. This framework is centered on the fundamental duality between the compactness (e.g., sparsity) of the continuous signal and the rank of a structured matrix, whose entries are functions of the signal. This property enables the reformulation of the signal recovery as a low-rank structured matrix completion, which comes with performance guarantees. We will also review fast algorithms that are comparable in complexity to current compressed sensing methods, which enables the application of the framework to large-scale magnetic resonance (MR) recovery problems. The remarkable flexibility of the formulation can be used to exploit signal properties that are difficult to capture by current sparse and low-rank optimization strategies. We demonstrate the utility of the framework in a wide range of MR imaging (MRI) applications, including highly accelerated imaging, calibration-free acquisition, MR artifact correction, and ungated dynamic MRI. The slow nature of signal acquisition in magnetic resonance imaging (MRI), where the image is formed from a sequence of Fourier samples, often restricts the achievable spatial and temporal resolution in multidimensional static and dynamic imaging applications. Discrete compressed sensing (CS) methods provided a major breakthrough to accelerate the magnetic resonance (MR) signal acquisition by reducing the sampling burden. As described in an introductory article in this special issue [1] these algorithms exploited the sparsity of the discrete signal in a transform domain to recover the images from a few measurements. In this paper, we review a continuous domain extension of CS using a structured low-rank (SLR) framework for the recovery of an image or a series of images from a few measurements using various compactness assumptions [2]-[22]. The general strategy of the SLR framework starts with defining a lifting operation to construct a structured matrix, whose entries are functions of the signal samples. The SLR algorithms exploit the dual relationships between the signal compactness properties (e.g. This dual relationship allows recovery of the signal from a few samples in the measurement domain as an SLR optimization problem. MJ and MM are with the University of Iowa, Iowa City, IA 52242 (emails: mathews-jacob@uiowa.edu,merry-mani@uiowa.edu). JCY is with the Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea (email: jong.ye@kaist.ac.kr).


Convergent Policy Optimization for Safe Reinforcement Learning

arXiv.org Machine Learning

We study the safe reinforcement learning problem with nonlinear function approximation, where policy optimization is formulated as a constrained optimization problem with both the objective and the constraint being nonconvex functions. For such a problem, we construct a sequence of surrogate convex constrained optimization problems by replacing the nonconvex functions locally with convex quadratic functions obtained from policy gradient estimators. We prove that the solutions to these surrogate problems converge to a stationary point of the original nonconvex problem. Furthermore, to extend our theoretical results, we apply our algorithm to examples of optimal control and multi-agent reinforcement learning with safety constraints.


Tensor Q-Rank: A New Data Dependent Tensor Rank

arXiv.org Machine Learning

Recently, the \textit{Tensor Nuclear Norm~(TNN)} regularization based on t-SVD has been widely used in various low tubal-rank tensor recovery tasks. However, these models usually require smooth change of data along the third dimension to ensure their low rank structures. In this paper, we propose a new definition of tensor rank named \textit{tensor Q-rank} by a column orthonormal matrix $\mathbf{Q}$, and further make $\mathbf{Q}$ data-dependent. With $\mathbf{Q}$ satisfying our orthogonal proximal constraint, the data tensor may have a more significant low tensor Q-rank structure than that of low tubal-rank structure. We also provide a corresponding envelope of our rank function and apply it to the low rank tensor completion problem. Then we give an effective algorithm and briefly analyze why our method works better than TNN based methods in the case of complex data with low sampling rate. Finally, experimental results on real-world datasets demonstrate the superiority of our proposed model in the tensor completion problem.


Implicit Posterior Variational Inference for Deep Gaussian Processes

arXiv.org Machine Learning

A multi-layer deep Gaussian process (DGP) model is a hierarchical composition of GP models with a greater expressive power. Exact DGP inference is intractable, which has motivated the recent development of deterministic and stochastic approximation methods. Unfortunately, the deterministic approximation methods yield a biased posterior belief while the stochastic one is computationally costly. This paper presents an implicit posterior variational inference (IPVI) framework for DGPs that can ideally recover an unbiased posterior belief and still preserve time efficiency. Inspired by generative adversarial networks, our IPVI framework achieves this by casting the DGP inference problem as a two-player game in which a Nash equilibrium, interestingly, coincides with an unbiased posterior belief. This consequently inspires us to devise a best-response dynamics algorithm to search for a Nash equilibrium (i.e., an unbiased posterior belief). Empirical evaluation shows that IPVI outperforms the state-of-the-art approximation methods for DGPs.


Neural Spectrum Alignment

arXiv.org Machine Learning

Expressiveness of deep models was recently addressed via the connection between neural networks (NNs) and kernel learning, where first-order dynamics of NN during a gradient-descent (GD) optimization were related to gradient similarity kernel, also known as Neural Tangent Kernel (NTK). In the majority of works this kernel is considered to be time-invariant, with its properties being defined entirely by NN architecture and independent of the learning task at hand. In contrast, in this paper we empirically explore these properties along the optimization and show that in practical applications the NN kernel changes in a very dramatic and meaningful way, with its top eigenfunctions aligning toward the target function learned by NN. Moreover, these top eigenfunctions serve sort of basis functions for NN output - a function represented by NN is spanned almost completely by them for the entire optimization process. Further, since the learning along top eigenfunctions is typically fast, their alignment with the target function improves the overall optimization performance. In addition, we study how the neural spectrum is affected by learning rate decay, typically done by practitioners, showing various trends in the kernel behavior. We argue that the presented phenomena may lead to a more complete theoretical understanding behind NN learning.


Label Smoothing and Logit Squeezing: A Replacement for Adversarial Training?

arXiv.org Machine Learning

A BSTRACT Adversarial training is one of the strongest defenses against adversarial attacks, but it requires adversarial examples to be generated for every mini-batch during optimization. The expense of producing these examples during training often precludes adversarial training from use on complex image datasets. In this study, we explore the mechanisms by which adversarial training improves classifier robustness, and show that these mechanisms can be effectively mimicked using simple regularization methods, including label smoothing and logit squeezing. Remarkably, using these simple regularization methods in combination with Gaussian noise injection, we are able to achieve strong adversarial robustness - often exceeding that of adversarial training - using no adversarial examples. However, the existence of adversarial examples has raised concerns about the security of computer vision systems (Szegedy et al., 2013; Biggio et al., 2013). For example, an attacker may cause a system to mistake a stop sign for another object (Evtimov et al., 2017) or mistake one person for another (Sharif et al., 2016). To address security concerns for high-stakes applications, researchers are searching for ways to make models more robust to attacks. Many defenses have been proposed to combat adversarial examples. Approaches such as feature squeezing, denoising, and encoding (Xu et al., 2017; Samangouei et al., 2018; Shen et al., 2017; Meng & Chen, 2017) have had some success at pre-processing images to remove adversarial perturbations. Other approaches focus on hardening neural classifiers to reduce adversarial susceptibility. This includes specialized non-linearities (Zantedeschi et al., 2017), modified training processes Pa-pernot et al. (2016), and gradient obfuscation Athalye et al. (2018).


Kernelized Wasserstein Natural Gradient

arXiv.org Machine Learning

Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on Cifar10 and Cifar100 empirically.


Dynamic Local Regret for Non-convex Online Forecasting

arXiv.org Machine Learning

We consider online forecasting problems for non-convex machine learning models. Forecasting introduces several challenges such as (i) frequent updates are necessary to deal with concept drift issues since the dynamics of the environment change over time, and (ii) the state of the art models are non-convex models. We address these challenges with a novel regret framework. Standard regret measures commonly do not consider both dynamic environment and non-convex models. We introduce a local regret for non-convex models in a dynamic environment. We present an update rule incurring a cost, according to our proposed local regret, which is sublinear in time T. Our update uses time-smoothed gradients. Using a real-world dataset we show that our time-smoothed approach yields several benefits when compared with state-of-the-art competitors: results are more stable against new data; training is more robust to hyperparameter selection; and our approach is more computationally efficient than the alternatives.