Optimization
Graph Ordering: Towards the Optimal by Learning
Zhao, Kangfei, Rong, Yu, Yu, Jeffrey Xu, Huang, Junzhou, Zhang, Hao
Graph representation learning has achieved a remarkable success in many graph-based applications, such as node classification, link prediction, and community detection. These models are usually designed to preserve the vertex information at different granularity and reduce the problems in discrete space to some machine learning tasks in continuous space. However, regardless of the fruitful progress, for some kind of graph applications, such as graph compression and edge partition, it is very hard to reduce them to some graph representation learning tasks. Moreover, these problems are closely related to reformulating a global layout for a specific graph, which is an important NP-hard combinatorial optimization problem: graph ordering. In this paper, we propose to attack the graph ordering problem behind such applications by a novel learning approach. Distinguished from greedy algorithms based on predefined heuristics, we propose a neural network model: Deep Order Network (DON) to capture the hidden locality structure from partial vertex order sets. Supervised by sampled partial order, DON has the ability to infer unseen combinations. Furthermore, to alleviate the combinatorial explosion in the training space of DON and make the efficient partial vertex order sampling , we employ a reinforcement learning model: the Policy Network, to adjust the partial order sampling probabilities during the training phase of DON automatically. To this end, the Policy Network can improve the training efficiency and guide DON to evolve towards a more effective model automatically. Comprehensive experiments on both synthetic and real data validate that DON-RL outperforms the current state-of-the-art heuristic algorithm consistently. Two case studies on graph compression and edge partitioning demonstrate the potential power of DON-RL in real applications.
FlexiBO: Cost-Aware Multi-Objective Optimization of Deep Neural Networks
Iqbal, Md Shahriar, Su, Jianhai, Kotthoff, Lars, Jamshidi, Pooyan
One of the key challenges in designing machine learning systems is to determine the right balance amongst several objectives, which also oftentimes are incommensurable and conflicting. For example, when designing deep neural networks (DNNs), one often has to trade-off between multiple objectives, such as accuracy, energy consumption, and inference time. Typically, there is no single configuration that performs equally well for all objectives. Consequently, one is interested in identifying Pareto-optimal designs. Although different multi-objective optimization algorithms have been developed to identify Pareto-optimal configurations, state-of-the-art multi-objective optimization methods do not consider the different evaluation costs attending the objectives under consideration. This is particularly important for optimizing DNNs: the cost arising on account of assessing the accuracy of DNNs is orders of magnitude higher than that of measuring the energy consumption of pre-trained DNNs. We propose FlexiBO, a flexible Bayesian optimization method, to address this issue. We formulate a new acquisition function based on the improvement of the Pareto hyper-volume weighted by the measurement cost of each objective. Our acquisition function selects the next sample and objective that provides maximum information gain per unit of cost. We evaluated FlexiBO on 7 state-of-the-art DNNs for object detection, natural language processing, and speech recognition. Our results indicate that, when compared to other state-of-the-art methods across the 7 architectures we tested, the Pareto front obtained using FlexiBO has, on average, a 28.44% higher contribution to the true Pareto front and achieves 25.64% better diversity.
Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives
Dedieu, Antoine, Hazimeh, Hussein, Mazumder, Rahul
We consider a discrete optimization based approach for learning sparse classifiers, where the outcome depends upon a linear combination of a small subset of features. Recent work has shown that mixed integer programming (MIP) can be used to solve (to optimality) $\ell_0$-regularized problems at scales much larger than what was conventionally considered possible in the statistics and machine learning communities. Despite their usefulness, MIP-based approaches are significantly slower compared to relatively mature algorithms based on $\ell_1$-regularization and relatives. We aim to bridge this computational gap by developing new MIP-based algorithms for $\ell_0$-regularized classification. We propose two classes of scalable algorithms: an exact algorithm that can handle $p\approx 50,000$ features in a few minutes, and approximate algorithms that can address instances with $p\approx 10^6$ in times comparable to fast $\ell_1$-based algorithms. Our exact algorithm is based on the novel idea of \textsl{integrality generation}, which solves the original problem (with $p$ binary variables) via a sequence of mixed integer programs that involve a small number of binary variables. Our approximate algorithms are based on coordinate descent and local combinatorial search. In addition, we present new estimation error bounds for a class of $\ell_0$-regularized estimators. Experiments on real and synthetic data demonstrate that our approach leads to models with considerably improved statistical performance (especially, variable selection) when compared to competing toolkits.
Scalable Hyperparameter Optimization with Lazy Gaussian Processes
Ram, Raju, Mรผller, Sabine, Pfreundt, Franz-Josef, Gauger, Nicolas R., Keuper, Janis
Most machine learning methods require careful selection of hyper-parameters in order to train a high performing model with good generalization abilities. Hence, several automatic selection algorithms have been introduced to overcome tedious manual (try and error) tuning of these parameters. Due to its very high sample efficiency, Bayesian Optimization over a Gaussian Processes modeling of the parameter space has become the method of choice. Unfortunately, this approach suffers from a cubic compute complexity due to underlying Cholesky factorization, which makes it very hard to be scaled beyond a small number of sampling steps. In this paper, we present a novel, highly accurate approximation of the underlying Gaussian Process. Reducing its computational complexity from cubic to quadratic allows an efficient strong scaling of Bayesian Optimization while outperforming the previous approach regarding optimization accuracy. The first experiments show speedups of a factor of 162 in single node and further speed up by a factor of 5 in a parallel environment.
Information Newton's flow: second-order optimization method in probability space
We introduce a framework for Newton's flows in probability space with information metrics, named information Newton's flows. Here two information metrics are considered, including both the Fisher-Rao metric and the Wasserstein-2 metric. Several examples of information Newton's flows for learning objective/loss functions are provided, such as Kullback-Leibler (KL) divergence, Maximum mean discrepancy (MMD), and cross entropy. The asymptotic convergence results of proposed Newton's methods are provided. A known fact is that overdamped Langevin dynamics correspond to Wasserstein gradient flows of KL divergence. Extending this fact to Wasserstein Newton's flows of KL divergence, we derive Newton's Langevin dynamics. We provide examples of Newton's Langevin dynamics in both one-dimensional space and Gaussian families. For the numerical implementation, we design sampling efficient variational methods to approximate Wasserstein Newton's directions. Several numerical examples in Gaussian families and Bayesian logistic regression are shown to demonstrate the effectiveness of the proposed method.
A Survey on the Use of Preferences for Virtual Machine Placement in Cloud Data Centers
Alashaikh, Abdulaziz, Alanazi, Eisa, Al-Fuqaha, Ala
With the rapid development of virtualization techniques, cloud data centers allow for cost effective, flexible, and customizable deployments of applications on virtualized infrastructure. Virtual machine (VM) placement aims to assign each virtual machine to a server in the cloud environment. VM Placement is of paramount importance to the design of cloud data centers. Typically, VM placement involves complex relations and multiple design factors as well as local policies that govern the assignment decisions. It also involves different constituents including cloud administrators and customers that might have disparate preferences while opting for a placement solution. Thus, it is often valuable to not only return an optimized solution to the VM placement problem but also a solution that reflects the given preferences of the constituents. In this paper, we provide a detailed review on the role of preferences in the recent literature on VM placement. We further discuss key challenges and identify possible research opportunities to better incorporate preferences within the context of VM placement.
IBM doubles quantum volume in the race for computing supremacy
TechRepublic's Teena Maddox talked to Jamie Garcia, senior manager, Algorithms, Applications and Theory Team at IBM Research, at CES 2020 about about the quantum news that IBM has released this week and what's to come. The following is an edited transcript of their conversation. Jamie Garcia: We just released news that we have achieved a 32-quantum volume, which is in line with us doubling our quantum volume every single year. Another one that just came out is that we hit a 100 partners in our IBM Q network. Teena Maddox: With the race for quantum computing, what does that mean when you've increased the volume?
Resolving learning rates adaptively by locating Stochastic Non-Negative Associated Gradient Projection Points using line searches
Kafka, Dominic, Wilke, Daniel N.
Learning rates in stochastic neural network training are currently determined a priori to training, using expensive manual or automated iterative tuning. This study proposes gradient-only line searches to resolve the learning rate for neural network training algorithms. Stochastic sub-sampling during training decreases computational cost and allows the optimization algorithms to progress over local minima. However, it also results in discontinuous cost functions. Minimization line searches are not effective in this context, as they use a vanishing derivative (first order optimality condition), which often do not exist in a discontinuous cost function and therefore converge to discontinuities as opposed to minima from the data trends. Instead, we base candidate solutions along a search direction purely on gradient information, in particular by a directional derivative sign change from negative to positive (a Non-negative Associative Gradient Projection Point (NN- GPP)). Only considering a sign change from negative to positive always indicates a minimum, thus NN-GPPs contain second order information. Conversely, a vanishing gradient is purely a first order condition, which may indicate a minimum, maximum or saddle point. This insight allows the learning rate of an algorithm to be reliably resolved as the step size along a search direction, increasing convergence performance and eliminating an otherwise expensive hyperparameter.
Sparse Covariance Estimation in Logit Mixture Models
Aboutaleb, Youssef M, Danaf, Mazen, Xie, Yifei, Ben-Akiva, Moshe
This paper introduces a new data-driven methodology for estimating sparse covariance matrices of the random coefficients in logit mixture models. Researchers typically specify covariance matrices in logit mixture models under one of two extreme assumptions: either an unrestricted full covariance matrix (allowing correlations between all random coefficients), or a restricted diagonal matrix (allowing no correlations at all). Our objective is to find optimal subsets of correlated coefficients for which we estimate covariances. We propose a new estimator, called MISC, that uses a mixed-integer optimization (MIO) program to find an optimal block diagonal structure specification for the covariance matrix, corresponding to subsets of correlated coefficients, for any desired sparsity level using Markov Chain Monte Carlo (MCMC) posterior draws from the unrestricted full covariance matrix. The optimal sparsity level of the covariance matrix is determined using out-of-sample validation. We demonstrate the ability of MISC to correctly recover the true covariance structure from synthetic data. In an empirical illustration using a stated preference survey on modes of transportation, we use MISC to obtain a sparse covariance matrix indicating how preferences for attributes are related to one another.
Distributed Learning in the Non-Convex World: From Batch to Streaming Data, and Beyond
Chang, Tsung-Hui, Hong, Mingyi, Wai, Hoi-To, Zhang, Xinwei, Lu, Songtao
Distributed learning has become a critical enabler of the massively connected world envisioned by many. This article discusses four key elements of scalable distributed processing and real-time intelligence -- problems, data, communication and computation. Our aim is to provide a fresh and unique perspective about how these elements should work together in an effective and coherent manner. In particular, we provide a selective review about the recent techniques developed for optimizing non-convex models (i.e., problem classes), processing batch and streaming data (i.e., data types), over the networks in a distributed manner (i.e., communication and computation paradigm). We describe the intuitions and connections behind a core set of popular distributed algorithms, emphasizing how to trade off between computation and communication costs. Practical issues and future research directions will also be discussed. We are living in a highly connected world, and it will become exponentially more connected in a decade. These devices collect a huge amount of real-time data, perform complex computational tasks, and provide vital services which significantly improve our lives and enrich our collective productivity. THC, MH, HTW are ordered alphabetically, and contributed equally. MH is the corresponding author. THC is with the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China. MH and XZ are with the ECE Department, University of Minnesota, MN, USA. HTW is with the Department of SEEM, The Chinese University of Hong Kong, Hong Kong SAR, China. SL is with IBM Research AI, IBM Thomas J. Watson Research Center Y orktown Heights, New Y ork 10598, USA.