Optimization
Optimizing generalization on the train set: a novel gradient-based framework to train parameters and hyperparameters simultaneously
Lounici, Karim, Meziani, Katia, Riu, Benjamin
Generalization is a central problem in Machine Learning. Most prediction methods require careful calibration of hyperparameters carried out on a hold-out \textit{validation} dataset to achieve generalization. The main goal of this paper is to present a novel approach based on a new measure of risk that allows us to develop novel fully automatic procedures for generalization. We illustrate the pertinence of this new framework in the regression problem. The main advantages of this new approach are: (i) it can simultaneously train the model and perform regularization in a single run of a gradient-based optimizer on all available data without any previous hyperparameter tuning; (ii) this framework can tackle several additional objectives simultaneously (correlation, sparsity,...) $via$ the introduction of regularization parameters. Noticeably, our approach transforms hyperparameter tuning as well as feature selection (a combinatorial discrete optimization problem) into a continuous optimization problem that is solvable via classical gradient-based methods ; (iii) the computational complexity of our methods is $O(npK)$ where $n,p,K$ denote respectively the number of observations, features and iterations of the gradient descent algorithm. We observe in our experiments a significantly smaller runtime for our methods as compared to benchmark methods for equivalent prediction score. Our procedures are implemented in PyTorch (code is available for replication).
Learning to Solve Combinatorial Optimization Problems on Real-World Graphs in Linear Time
Drori, Iddo, Kharkar, Anant, Sickinger, William R., Kates, Brandon, Ma, Qiang, Ge, Suwen, Dolev, Eden, Dietrich, Brenda, Williamson, David P., Udell, Madeleine
Combinatorial optimization algorithms for graph problems are usually designed afresh for each new problem with careful attention by an expert to the problem structure. In this work, we develop a new framework to solve any combinatorial optimization problem over graphs that can be formulated as a single player game defined by states, actions, and rewards, including minimum spanning tree, shortest paths, traveling salesman problem, and vehicle routing problem, without expert knowledge. Our method trains a graph neural network using reinforcement learning on an unlabeled training set of graphs. The trained network then outputs approximate solutions to new graph instances in linear running time. In contrast, previous approximation algorithms or heuristics tailored to NP-hard problems on graphs generally have at least quadratic running time. We demonstrate the applicability of our approach on both polynomial and NP-hard problems with optimality gaps close to 1, and show that our method is able to generalize well: (i) from training on small graphs to testing on large graphs; (ii) from training on random graphs of one type to testing on random graphs of another type; and (iii) from training on random graphs to running on real world graphs.
Convergence of adaptive algorithms for weakly convex constrained optimization
Alacaoglu, Ahmet, Malitsky, Yura, Cevher, Volkan
We analyze the adaptive first order algorithm AMSGrad, for solving a constrained stochastic optimization problem with a weakly convex objective. We prove the $\mathcal{\tilde O}(t^{-1/4})$ rate of convergence for the norm of the gradient of Moreau envelope, which is the standard stationarity measure for this class of problems. It matches the known rates that adaptive algorithms enjoy for the specific case of unconstrained smooth stochastic optimization. Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly unbounded optimization domains. Finally, we illustrate the applications and extensions of our results to specific problems and algorithms.
The Backbone Method for Ultra-High Dimensional Sparse Machine Learning
Bertsimas, Dimitris, Digalakis, Vassilis Jr
We present the backbone method, a generic framework that enables sparse and interpretable supervised machine learning methods to scale to ultra-high dimensional problems. We solve, in minutes, sparse regression problems with $p\sim10^7$ features and decision tree induction problems with $p\sim10^5$ features. The proposed method operates in two phases; we first determine the backbone set, that consists of potentially relevant features, by solving a number of tractable subproblems; then, we solve a reduced problem, considering only the backbone features. Numerical experiments demonstrate that our method competes with optimal solutions, when exact methods apply, and substantially outperforms baseline heuristics, when exact methods do not scale, both in terms of recovering the true relevant features and in its out-of-sample predictive performance.
Recovery and Generalization in Over-Realized Dictionary Learning
Sulam, Jeremias, You, Chong, Zhu, Zhihui
In over two decades of research, the field of dictionary learning has gathered a large collection of successful applications, and theoretical guarantees for model recovery are known only whenever optimization is carried out in the same model class as that of the underlying dictionary. This work characterizes the surprising phenomenon that dictionary recovery can be facilitated by searching over the space of larger over-realized models. This observation is general and independent of the specific dictionary learning algorithm used. We thoroughly demonstrate this observation in practice and provide a theoretical analysis of this phenomenon by tying recovery measures to generalization bounds. We further show that an efficient and provably correct distillation mechanism can be employed to recover the correct atoms from the over-realized model. As a result, our meta-algorithm provides dictionary estimates with consistently better recovery of the ground-truth model.
Anytime MiniBatch: Exploiting Stragglers in Online Distributed Optimization
Ferdinand, Nuwan, Al-Lawati, Haider, Draper, Stark C., Nokleby, Matthew
Distributed optimization is vital in solving large-scale machine learning problems. A widely-shared feature of distributed optimization techniques is the requirement that all nodes complete their assigned tasks in each computational epoch before the system can proceed to the next epoch. In such settings, slow nodes, called stragglers, can greatly slow progress. To mitigate the impact of stragglers, we propose an online distributed optimization method called Anytime Minibatch. In this approach, all nodes are given a fixed time to compute the gradients of as many data samples as possible. The result is a variable per-node minibatch size. Workers then get a fixed communication time to average their minibatch gradients via several rounds of consensus, which are then used to update primal variables via dual averaging. Anytime Minibatch prevents stragglers from holding up the system without wasting the work that stragglers can complete. We present a convergence analysis and analyze the wall time performance. Our numerical results show that our approach is up to 1.5 times faster in Amazon EC2 and it is up to five times faster when there is greater variability in compute node performance.
Bayesian Experience Reuse for Learning from Multiple Demonstrators
Gimelfarb, Michael, Sanner, Scott, Lee, Chi-Guhn
Learning from demonstrations (LfD) improves the exploration efficiency of a learning agent by incorporating demonstrations from experts. However, demonstration data can often come from multiple experts with conflicting goals, making it difficult to incorporate safely and effectively in online settings. We address this problem in the static and dynamic optimization settings by modelling the uncertainty in source and target task functions using normal-inverse-gamma priors, whose corresponding posteriors are, respectively, learned from demonstrations and target data using Bayesian neural networks with shared features. We use this learned belief to derive a quadratic programming problem whose solution yields a probability distribution over the expert models. Finally, we propose Bayesian Experience Reuse (BERS) to sample demonstrations in accordance with this distribution and reuse them directly in new tasks. We demonstrate the effectiveness of this approach for static optimization of smooth functions, and transfer learning in a high-dimensional supply chain problem with cost uncertainty.
Balancing Fairness and Efficiency in an Optimization Model
Chen, Violet Xinying, Hooker, J. N.
Optimization models generally aim for efficiency by maximizing total benefit or minimizing cost. Yet a trade-off between fairness and efficiency is an important element of many practical decisions. We propose a principled and practical method for balancing these two criteria in an optimization model. Following a critical assessment of existing schemes, we define a set of social welfare functions (SWFs) that combine Rawlsian leximax fairness and utilitarianism and overcome some of the weaknesses of previous approaches. In particular, we regulate the equity/efficiency trade-off with a single parameter that has a meaningful interpretation in practical contexts. We formulate the SWFs using mixed integer constraints and sequentially maximize them subject to constraints that define the problem at hand. After providing practical step-by-step instructions for implementation, we demonstrate the method on problems of realistic size involving healthcare resource allocation and disaster preparation. The solution times are modest, ranging from a fraction of a second to 18 seconds for a given value of the trade-off parameter.
Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints
Bai, Qinbo, Aggarwal, Vaneet, Gattami, Ather
In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $\gamma\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve $O(T^{1/2+\gamma})$ regret bound for the obtained reward and $O(T^{1-\gamma/2})$ regret bound for the constraint violation, where $T$ is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.
Reinforcement Learning from a Mixture of Interpretable Experts
Akrour, Riad, Tateo, Davide, Peters, Jan
Reinforcement learning (RL) has demonstrated its ability to solve high dimensional tasks by leveraging non-linear function approximators. These successes however are mostly achieved by 'black-box' policies in simulated domains. When deploying RL to the real world, several concerns regarding the use of a 'black-box' policy might be raised. In an effort to make the policies learned by RL more transparent, we propose in this paper a policy iteration scheme that retains a complex function approximator for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure, based on a mixture of interpretable experts. We show that our proposed algorithm can learn compelling policies on continuous action deep RL benchmarks, matching the performance of neural network based policies, but returns policies that are more amenable to human inspection than neural network or linear-in-feature policies.