Optimization
Cost-aware Bayesian Optimization
Lee, Eric Hans, Perrone, Valerio, Archambeau, Cedric, Seeger, Matthias
Bayesian optimization (BO) is a class of global optimization algorithms, suitable for minimizing an expensive objective function in as few function evaluations as possible. While BO budgets are typically given in iterations, this implicitly measures convergence in terms of iteration count and assumes each evaluation has identical cost. In practice, evaluation costs may vary in different regions of the search space. For example, the cost of neural network training increases quadratically with layer size, which is a typical hyperparameter. Cost-aware BO measures convergence with alternative cost metrics such as time, energy, or money, for which vanilla BO methods are unsuited. We introduce Cost Apportioned BO (CArBO), which attempts to minimize an objective function in as little cost as possible. CArBO combines a cost-effective initial design with a cost-cooled optimization phase which depreciates a learned cost model as iterations proceed. On a set of 20 black-box function optimization problems we show that, given the same cost budget, CArBO finds significantly better hyperparameter configurations than competing methods.
Interpretable machine learning models: a physics-based view
Matei, Ion, de Kleer, Johan, Somarakis, Christoforos, Rai, Rahul, Baras, John S.
To understand changes in physical systems and facilitate decisions, explaining how model predictions are made is crucial. We use model-based interpretability, where models of physical systems are constructed by composing basic constructs that explain locally how energy is exchanged and transformed. We use the port Hamiltonian (p-H) formalism to describe the basic constructs that contain physically interpretable processes commonly found in the behavior of physical systems. We describe how we can build models out of the p-H constructs and how we can train them. In addition we show how we can impose physical properties such as dissipativity that ensure numerical stability of the training process. We give examples on how to build and train models for describing the behavior of two physical systems: the inverted pendulum and swarm dynamics. I. Introduction The necessity for interpretability comes from the fact that it is not always enough to train and model and get an answer, but is also important to understand why a particular answer was given. A simple but meaningful definition of model interpretability given in [17] relates this notion to the degree to which a human can understand the cause of a decision. In our case, since we care about models that describe the behavior of physical systems, we change the definition to the degree to which a human can understand the physical processes that cause a prediction. Throughout this paper we focus on physically-interpretable models: models that embed physical laws that explain how energy is transformed and exchanged in the system. A physically-interpretable model facilitates learning and updating the model when something unexpected happens. This update is done by finding an explanation for an unexpected event. For example, an electrical motor unexpectedly overheats and we ask ourselves: "Why is the motor overheating?".
Towards Automatic Bayesian Optimization: A first step involving acquisition functions
Merchรกn, Eduardo C. Garrido, Pรฉrez, Luis C. Jariego
Bayesian Optimization is the state of the art technique for the optimization of black boxes, i.e., functions where we do not have access to their analytical expression nor its gradients, they are expensive to evaluate and its evaluation is noisy. The most popular application of bayesian optimization is the automatic hyperparameter tuning of machine learning algorithms, where we obtain the best configuration of machine learning algorithms by optimizing the estimation of the generalization error of these algorithms. Despite being applied with success, bayesian optimization methodologies also have hyperparameters that need to be configured such as the probabilistic surrogate model or the acquisition function used. A bad decision over the configuration of these hyperparameters implies obtaining bad quality results. Typically, these hyperparameters are tuned by making assumptions of the objective function that we want to evaluate but there are scenarios where we do not have any prior information about the objective function. In this paper, we propose a first attempt over automatic bayesian optimization by exploring several heuristics that automatically tune the acquisition function of bayesian optimization. We illustrate the effectiveness of these heurisitcs in a set of benchmark problems and a hyperparameter tuning problem of a machine learning algorithm.
Black-box Methods for Restoring Monotonicity
Gergatsouli, Evangelia, Lucier, Brendan, Tzamos, Christos
In many practical applications, heuristic or approximation algorithms are used to efficiently solve the task at hand. However their solutions frequently do not satisfy natural monotonicity properties of optimal solutions. In this work we develop algorithms that are able to restore monotonicity in the parameters of interest. Specifically, given oracle access to a (possibly non-monotone) multi-dimensional real-valued function $f$, we provide an algorithm that restores monotonicity while degrading the expected value of the function by at most $\varepsilon$. The number of queries required is at most logarithmic in $1/\varepsilon$ and exponential in the number of parameters. We also give a lower bound showing that this exponential dependence is necessary. Finally, we obtain improved query complexity bounds for restoring the weaker property of $k$-marginal monotonicity. Under this property, every $k$-dimensional projection of the function $f$ is required to be monotone. The query complexity we obtain only scales exponentially with $k$.
aphBO-2GP-3B: A budgeted asynchronously-parallel multi-acquisition for known/unknown constrained Bayesian optimization on high-performing computing architecture
Tran, Anh, McCann, Scott, Furlan, John M., Pagalthivarthi, Krishnan V., Visintainer, Robert J., Wildey, Tim
High-fidelity complex engineering simulations are highly predictive, but also computationally expensive and often require substantial computational efforts. The mitigation of computational burden is usually enabled through parallelism in high-performance cluster (HPC) architecture. In this paper, an asynchronous constrained batch-parallel Bayesian optimization method is proposed to efficiently solve the computationally-expensive simulation-based optimization problems on the HPC platform, with a budgeted computational resource, where the maximum number of simulations is a constant. The advantages of this method are three-fold. First, the efficiency of the Bayesian optimization is improved, where multiple input locations are evaluated massively parallel in an asynchronous manner to accelerate the optimization convergence with respect to physical runtime. This efficiency feature is further improved so that when each of the inputs is finished, another input is queried without waiting for the whole batch to complete. Second, the method can handle both known and unknown constraints. Third, the proposed method considers several acquisition functions at the same time and sample based on an evolving probability mass distribution function using GP-Hedge scheme, where parameters are corresponding to the performance of each acquisition function. The proposed framework is termed aphBO-2GP-3B, which corresponds to asynchronous parallel hedge Bayesian optimization with two Gaussian processes and three batches. The aphBO-2GP-3B framework is demonstrated using two high-fidelity expensive industrial applications, where the first one is based on finite element analysis (FEA) and the second one is based on computational fluid dynamics (CFD) simulations.
A unified framework for spectral clustering in sparse graphs
Dall'Amico, Lorenzo, Couillet, Romain, Tremblay, Nicolas
One of the most natural tasks in graph theory is community detection, i.e., the identification of similarity groups on a given network. Practically, for an unweighted and undirected graph G(V, E) with V n nodes and E edges, community detection consists in finding a non-overlapping partition of the nodes that identifies underlying communities in a completely unsupervised manner. There is no unique definition of a community, but a general criterion is to impose that nodes in the same community have more interconnections than nodes in different communities, as a consequence of the stronger affinity among members of the same community [17]. There exist many ways of formalizing this intuition, some of them under the form of a cost function to minimize, such as the MinCut, RatioCut, and NormalizedCut costs [53]. The resulting optimizations are however NPhard problems and, as a consequence, many algorithms consist in retrieving relaxed continuous solutions of the problem.
An Inexact Manifold Augmented Lagrangian Method for Adaptive Sparse Canonical Correlation Analysis with Trace Lasso Regularization
Canonical correlation analysis (CCA for short) describes the relationship between two sets of variables by finding some linear combinations of these variables that maximizing the correlation coefficient. However, in high-dimensional settings where the number of variables exceeds sample size, or in the case of that the variables are highly correlated, the traditional CCA is no longer appropriate. In this paper, an adaptive sparse version of CCA (ASCCA for short) is proposed by using the trace Lasso regularization. The proposed ASCCA reduces the instability of the estimator when the covariates are highly correlated, and thus improves its interpretation. The ASCCA is further reformulated to an optimization problem on Riemannian manifolds, and an manifold inexact augmented Lagrangian method is then proposed for the resulting optimization problem. The performance of the ASCCA is compared with the other sparse CCA techniques in different simulation settings, which illustrates that the ASCCA is feasible and efficient.
Learning to simulate and design for structural engineering
Chang, Kai-Hung, Cheng, Chin-Yi
In the architecture and construction industries, structural design for large buildings has always been laborious, time-consuming, and difficult to optimize. It is an iterative process that involves two steps: analyzing the current structural design by a slow and computationally expensive simulation, and then manually revising the design based on professional experience and rules. In this work, we propose an end-to-end learning pipeline to solve the size design optimization problem, which is to design the optimal cross-sections for columns and beams, given the design objectives and building code as constraints. We pre-train a graph neural network as a surrogate model to not only replace the structural simulation for speed but also use its differentiable nature to provide gradient signals to the other graph neural network for size optimization. Our results show that the pre-trained surrogate model can predict simulation results accurately, and the trained optimization model demonstrates the capability of designing convincing cross-section designs for buildings under various scenarios.
Learning Reward Machines for Partially Observable Reinforcement Learning
Icarte, Rodrigo Toro, Waldie, Ethan, Klassen, Toryn, Valenzano, Rick, Castro, Margarita, McIlraith, Sheila
Reward Machines (RMs), originally proposed for specifying problems in Reinforcement Learning (RL), provide a structured, automata-based representation of a reward function that allows an agent to decompose problems into subproblems that can be efficiently learned using off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential. Papers published at the Neural Information Processing Systems Conference.
Minimal Variance Sampling in Stochastic Gradient Boosting
Stochastic Gradient Boosting (SGB) is a widely used approach to regularization of boosting models based on decision trees. It was shown that, in many cases, random sampling at each iteration can lead to better generalization performance of the model and can also decrease the learning time. Different sampling approaches were proposed, where probabilities are not uniform, and it is not currently clear which approach is the most effective. In this paper, we formulate the problem of randomization in SGB in terms of optimization of sampling probabilities to maximize the estimation accuracy of split scoring used to train decision trees.This optimization problem has a closed-form nearly optimal solution, and it leads to a new sampling technique, which we call Minimal Variance Sampling (MVS).The method both decreases the number of examples needed for each iteration of boosting and increases the quality of the model significantly as compared to the state-of-the art sampling methods. The superiority of the algorithm was confirmed by introducing MVS as a new default option for subsampling in CatBoost, a gradient boosting library achieving state-of-the-art quality on various machine learning tasks. Papers published at the Neural Information Processing Systems Conference.