Optimization
Maximum a-Posteriori Estimation for the Gaussian Mixture Model via Mixed Integer Nonlinear Programming
Flaherty, Patrick, Wiratchotisatian, Pitchaya, Lee, Ji Ah, Trapp, Andrew C.
We present a global optimization approach for solving the classical maximum a-posteriori (MAP) estimation problem for the Gaussian mixture model. Our approach formulates the MAP estimation problem as a mixed-integer nonlinear optimization problem (MINLP). Our method provides a certificate of global optimality, can accommodate side constraints, and is extendable to other finite mixture models. We propose an approximation to the MINLP hat transforms it into a mixed integer quadratic program (MIQP) which preserves global optimality within desired accuracy and improves computational aspects. Numerical experiments compare our method to standard estimation approaches and show that our method finds the globally optimal MAP for some standard data sets, providing a benchmark for comparing estimation methods.
Bridging Bayesian and Minimax Mean Square Error Estimation via Wasserstein Distributionally Robust Optimization
Nguyen, Viet Anh, Shafieezadeh-Abadeh, Soroosh, Kuhn, Daniel, Esfahani, Peyman Mohajerin
We introduce a distributionally robust minimium mean square error estimation model with a Wasserstein ambiguity set to recover an unknown signal from a noisy observation. The proposed model can be viewed as a zero-sum game between a statistician choosing an estimator---that is, a measurable function of the observation---and a fictitious adversary choosing a prior---that is, a pair of signal and noise distributions ranging over independent Wasserstein balls---with the goal to minimize and maximize the expected squared estimation error, respectively. We show that if the Wasserstein balls are centered at normal distributions, then the zero-sum game admits a Nash equilibrium, where the players' optimal strategies are given by an {\em affine} estimator and a {\em normal} prior, respectively. We further prove that this Nash equilibrium can be computed by solving a tractable convex program. Finally, we develop a Frank-Wolfe algorithm that can solve this convex program orders of magnitude faster than state-of-the-art general purpose solvers. We show that this algorithm enjoys a linear convergence rate and that its direction-finding subproblems can be solved in quasi-closed form.
Penalty Method for Inversion-Free Deep Bilevel Optimization
Bilevel optimizations are at the center of several important machine learning problems such as hyperparameter tuning, data denoising, few-shot learning, data poisoning. Different from simultaneous or multi-objective optimization, obtaining the exact descent direction for continuous bilevel optimization requires computing the inverse of the hessian of the lower-level cost function, even for first order methods. In this paper, we propose a new method for solving bilevel optimization, using the penalty function, which avoids computing the inverse of the hessian. We prove convergence of the method under mild conditions and show that it computes the exact hypergradient asymptotically. Small space and time complexity of our method allows us to solve large-scale bilevel optimization problems involving deep neural networks with up to 3.8M upper-level and 1.4M lower-level variables. We present results of our method for data denoising on MNIST/CIFAR10/SVHN datasets, for few-shot learning on Omniglot/Mini-Imagenet datasets and for training-data poisoning on MNIST/Imagenet datasets. In all experiments, our method outperforms or is comparable to previously proposed methods both in terms of accuracy and run-time.
Inference with Deep Generative Priors in High Dimensions
Pandit, Parthe, Sahraee-Ardakan, Mojtaba, Rangan, Sundeep, Schniter, Philip, Fletcher, Alyson K.
Deep generative priors offer powerful models for complex-structured data, such as images, audio, and text. Using these priors in inverse problems typically requires estimating the input and/or hidden signals in a multi-layer deep neural network from observation of its output. While these approaches have been successful in practice, rigorous performance analysis is complicated by the non-convex nature of the underlying optimization problems. This paper presents a novel algorithm, Multi-Layer Vector Approximate Message Passing (ML-VAMP), for inference in multi-layer stochastic neural networks. ML-VAMP can be configured to compute maximum a priori (MAP) or approximate minimum mean-squared error (MMSE) estimates for these networks. We show that the performance of ML-VAMP can be exactly predicted in a certain high-dimensional random limit. Furthermore, under certain conditions, ML-VAMP yields estimates that achieve the minimum (i.e., Bayes-optimal) MSE as predicted by the replica method. In this way, ML-VAMP provides a computationally efficient method for multi-layer inference with an exact performance characterization and testable conditions for optimality in the large-system limit.
Podcast: How intelligent automation is transforming IT systems management
With IT environments growing more and more complex and demanding, traditional IT systems management is moving into a new era of intelligence-driven automation and optimization. New machine learning and artificial intelligence capabilities, combined with a common management framework, are changing the face of how IT operations are bringing all the pieces together―with automation being key. "The foundation of it begins with automation, because if you can automate, you become repeatable, consistent, and reliable, and those are all good in your data center," says Hewlett Packard Enterprise's Doug de Werd. Automation also supports greater innovation and creativity, he says, because it frees organizations to focus on the business, enabling quicker releases and time to market, among other competitive advantages. Looking forward, organizations will move toward autonomous computing, he adds, where new technologies will adjust compute, storage, and network capacity on the fly in a continuous optimization model. Where is your company on its systems management journey?
Generalized Self-concordant Hessian-barrier algorithms
Dvurechensky, Pavel, Staudigl, Mathias, Uribe, César A.
Many problems in statistical learning, imaging, and computer vision involve the optimization of a non-convex objective function with singularities at the boundary of the feasible set. For such challenging instances, we develop a new interior-point technique building on the Hessian-barrier algorithm recently introduced in Bomze, Mertikopoulos, Schachinger and Staudigl, [SIAM J. Opt. 2019 29(3), pp. 2100-2127], where the Riemannian metric is induced by a generalized self-concordant function. This class of functions is sufficiently general to include most of the commonly used barrier functions in the literature of interior point methods. We prove global convergence to an approximate stationary point of the method, and in cases where the feasible set admits an easily computable self-concordant barrier, we verify worst-case optimal iteration complexity of the method. Applications in non-convex statistical estimation and $L^{p}$-minimization are discussed to given the efficiency of the method.
Generalized Transformation-based Gradient
The reparameterization trick has become one of the most useful tools in the field of variational inference. However, the reparameterization trick is based on the standardization transformation which restricts the scope of application of this method to distributions that have tractable inverse cumulative distribution functions or are expressible as deterministic transformations of such distributions. In this paper, we generalized the reparameterization trick by allowing a general transformation. We discover that the proposed model is a special case of control variate indicating that the proposed model can combine the advantages of CV and generalized reparameterization. Based on the proposed gradient model, we propose a new polynomial-based gradient estimator which has better theoretical performance than the reparameterization trick under certain condition and can be applied to a larger class of variational distributions. In studies of synthetic and real data, we show that our proposed gradient estimator has a significantly lower gradient variance than other state-of-the-art methods thus enabling a faster inference procedure.
Optimizing Millions of Hyperparameters by Implicit Differentiation
Lorraine, Jonathan, Vicol, Paul, Duvenaud, David
We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations. We present results about the relationship between the IFT and differentiating through optimization, motivating our algorithm. We use the proposed approach to train modern network architectures with millions of weights and millions of hyper-parameters. For example, we learn a data-augmentation network - where every weight is a hyperparameter tuned for validation performance - outputting augmented training examples. Jointly tuning weights and hyperparameters with our approach is only a few times more costly in memory and compute than standard training.
A Programmable Approach to Model Compression
Joseph, Vinu, Muralidharan, Saurav, Garg, Animesh, Garland, Michael, Gopalakrishnan, Ganesh
Deep neural networks frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both model size and inference time without appreciable loss in accuracy. Compressing models before they are deployed can therefore result in significantly more efficient systems. However, while the results are desirable, finding the best compression strategy for a given neural network, target platform, and optimization objective often requires extensive experimentation. Moreover, finding optimal hyperparameters for a given compression strategy typically results in even more expensive, frequently manual, trial-and-error exploration. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build complex compression strategies. Given a strategy and a user-provided objective, such as minimization of running time, Condensa uses a novel sample-efficient constrained Bayesian optimization algorithm to automatically infer desirable sparsity ratios. Our experiments on three real-world image classification and language modeling tasks demonstrate memory footprint reductions of up to 65x and runtime throughput improvements of up to 2.22x using at most 10 samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa.
Energy Efficient Federated Learning Over Wireless Communication Networks
Yang, Zhaohui, Chen, Mingzhe, Saad, Walid, Hong, Choong Seon, Shikh-Bahaei, Mohammad
In this paper, the problem of energy efficient transmission and computation resource allocation for federated learning (FL) over wireless communication networks is investigated. In the considered model, each user exploits limited local computational resources to train a local FL model with its collected data and, then, sends the trained FL model parameters to a base station (BS) which aggregates the local FL model and broadcasts it back to all of the users. Since FL involves an exchange of a learning model between users and the BS, both computation and communication latencies are determined by the learning accuracy level. Meanwhile, due to the limited energy budget of the wireless users, both local computation energy and transmission energy must be considered during the FL process. This joint learning and communication problem is formulated as an optimization problem whose goal is to minimize a weighted sum of the completion time of FL, local computation energy, and transmission energy of all users, that captures the tradeoff of latency and energy consumption for FL. To solve this problem, an iterative algorithm is proposed where, at every step, closed-form solutions for time allocation, bandwidth allocation, power control, computation frequency, and learning accuracy are derived. For the special case that only minimizes the completion time, a bisection-based algorithm is proposed to obtain the optimal solution. Numerical results show that the proposed algorithms can reduce up to 25.6% delay and 37.6% energy consumption compared to conventional FL methods.