Optimization
Hyper-parameter Tuning under a Budget Constraint
Lu, Zhiyun, Chiang, Chao-Kai, Sha, Fei
Hyper-parameter tuning is of crucial importance to designing and deploying machine learning systems. Broadly, hyper-parameters include the architecture of the learning models, regularization parameters, optimization methods and their parameters, and other "knobs" to be tuned. It is challenging to explore the vast space of hyper-parameters efficiently to identify the optimal configuration. Quite a few approaches have been proposed and investigated: random search, Bayesian Optimization (BO) [30, 29], bandits-based Hyperband [17, 24], and meta-learning [5, 1, 10]. Many of those prior studies have focused on the aspect of reducing as much as possible the computation cost to obtain the optimal configuration. In this work, we look at a different but important perspective to hyper-parameter optimization - under a fixed time/computation cost, how we can improve the performance as much as possible.
Discretizing Continuous Action Space for On-Policy Optimization
The combinations of joint atomic actions, which quickly becomes explosion in the number of discrete actions can intractable when M increases. However, a simple fix be efficiently addressed by a policy with factorized is to represent the joint distribution over discrete actions as distribution across action dimensions. We factorized across dimensions, so that the joint policy is still show that the discrete policy achieves significant tractable. As prior works have applied such discretization performance gains with state-of-the-art on-policy method in practice (OpenAI, 2018; Jaลkowski et al., 2018), optimization algorithms (PPO, TRPO, ACKTR) we aim to carry out a systemic study of such straightforward especially on high-dimensional tasks with complex discretization method in simulated environments, and show dynamics. Additionally, we show that an ordinal how they improve upon on-policy optimization baselines.
Combinatorial Bayesian Optimization using Graph Representations
Oh, Changyong, Tomczak, Jakub M., Gavves, Efstratios, Welling, Max
This paper focuses on Bayesian Optimization - typically considered with continuous inputs - for discrete search input spaces, including integer, categorical or graph structured input variables. In Gaussian process-based Bayesian Optimization a problem arises, as it is not straightforward to define a proper kernel on discrete input structures, where no natural notion of smoothness or similarity could be provided. We propose COMBO, a method that represents values of discrete variables as vertices of a graph and then use the diffusion kernel on that graph. As the graph size explodes with the number of categorical variables and categories, we propose the graph Cartesian product to decompose the graph into smaller sub-graphs, enabling kernel computation in linear time with respect to the number of input variables. Moreover, in our formulation we learn a scale parameter per subgraph. In empirical studies on four discrete optimization problems we demonstrate that our method is on par or outperforms the state-of-the-art in discrete Bayesian optimization.
Bayesian active learning for optimization and uncertainty quantification in protein docking
Motivation: Ab initio protein docking represents a major challenge for optimizing a noisy and costly "black box"-like function in a high-dimensional space. Despite progress in this field, there is no docking method available for rigorous uncertainty quantification (UQ) of its solution quality (e.g. interface RMSD or iRMSD). Results: We introduce a novel algorithm, Bayesian Active Learning (BAL), for optimization and UQ of such black-box functions and flexible protein docking. BAL directly models the posterior distribution of the global optimum (or native structures for protein docking) with active sampling and posterior estimation iteratively feeding each other. Furthermore, we use complex normal modes to represent a homogeneous Euclidean conformation space suitable for high-dimension optimization and construct funnel-like energy models for encounter complexes. Over a protein docking benchmark set and a CAPRI set including homology docking, we establish that BAL significantly improve against both starting points by rigid docking and refinements by particle swarm optimization, providing for one third targets a top-3 near-native prediction. BAL also generates tight confidence intervals with half range around 25% of iRMSD and confidence level at 85%. Its estimated probability of a prediction being native or not achieves binary classification AUROC at 0.93 and AUPRC over 0.60 (compared to 0.14 by chance); and also found to help ranking predictions. To the best of our knowledge, this study represents the first uncertainty quantification solution for protein docking, with theoretical rigor and comprehensive assessment. Source codes are available at https://github.com/Shen-Lab/BAL.
Gaussian Conditional Random Fields for Classification
Petroviฤ, Andrija, Nikoliฤ, Mladen, Jovanoviฤ, Miloลก, Delibaลกiฤ, Boris
Gaussian conditional random fields (GCRF) are a well-known used structured model for continuous outputs that uses multiple unstructured predictors to form its features and at the same time exploits dependence structure among outputs, which is provided by a similarity measure. In this paper, a Gaussian conditional random fields model for structured binary classification (GCRFBC) is proposed. The model is applicable to classification problems with undirected graphs, intractable for standard classification CRFs. The model representation of GCRFBC is extended by latent variables which yield some appealing properties. Thanks to the GCRF latent structure, the model becomes tractable, efficient and open to improvements previously applied to GCRF regression models. In addition, the model allows for reduction of noise, that might appear if structures were defined directly between discrete outputs. Additionally, two different forms of the algorithm are presented: GCRFBCb (GCRGBC - Bayesian) and GCRFBCnb (GCRFBC - non Bayesian). The extended method of local variational approximation of sigmoid function is used for solving empirical Bayes in Bayesian GCRFBCb variant, whereas MAP value of latent variables is the basis for learning and inference in the GCRFBCnb variant. The inference in GCRFBCb is solved by Newton-Cotes formulas for one-dimensional integration. Both models are evaluated on synthetic data and real-world data. It was shown that both models achieve better prediction performance than unstructured predictors. Furthermore, computational and memory complexity is evaluated. Advantages and disadvantages of the proposed GCRFBCb and GCRFBCnb are discussed in detail.
ProBO: a Framework for Using Probabilistic Programming in Bayesian Optimization
Neiswanger, Willie, Kandasamy, Kirthevasan, Poczos, Barnabas, Schneider, Jeff, Xing, Eric
Optimizing an expensive-to-query function is a common task in science and engineering, where it is beneficial to keep the number of queries to a minimum. A popular strategy is Bayesian optimization (BO), which leverages probabilistic models for this task. Most BO today uses Gaussian processes (GPs), or a few other surrogate models. However, there is a broad set of Bayesian modeling techniques that we may want to use to capture complex systems and reduce the number of queries. Probabilistic programs (PPs) are modern tools that allow for flexible model composition, incorporation of prior information, and automatic inference. In this paper, we develop ProBO, a framework for BO using only standard operations common to most PPs. This allows a user to drop in an arbitrary PP implementation and use it directly in BO. To do this, we describe black box versions of popular acquisition functions that can be used in our framework automatically, without model-specific derivation, and show how to optimize these functions. We also introduce a model, which we term the Bayesian Product of Experts, that integrates into ProBO and can be used to combine information from multiple models implemented with different PPs. We show empirical results using multiple PP implementations, and compare against standard BO methods.
Contrasting Exploration in Parameter and Action Space: A Zeroth-Order Optimization Perspective
Vemula, Anirudh, Sun, Wen, Bagnell, J. Andrew
Black-box optimizers that explore in parameter space have often been shown to outperform more sophisticated action space exploration methods developed specifically for the reinforcement learning problem. We examine these black-box methods closely to identify situations in which they are worse than action space exploration methods and those in which they are superior. Through simple theoretical analyses, we prove that complexity of exploration in parameter space depends on the dimensionality of parameter space, while complexity of exploration in action space depends on both the dimensionality of action space and horizon length. This is also demonstrated empirically by comparing simple exploration methods on several model problems, including Contextual Bandit, Linear Regression and Reinforcement Learning in continuous control.
An Optimization Framework for Task Sequencing in Curriculum Learning
Foglino, Francesco, Leonetti, Matteo
Abstract--Curriculum learning is gaining popularity in (deep) reinforcement learning. It can alleviate the burden on data collection and provide better exploration policies through transfer and generalization from less complex tasks. Current methods for automatic task sequencing for curriculum learning in reinforcement learning provided initial heuristic solutions, with little to no guarantee on their quality. We introduce an optimization framework for task sequencing composed of the problem definition, several candidate performance metrics for optimization, and three benchmark algorithms. We experimentally show that the two most commonly used baselines (learning with no curriculum, and with a random curriculum) perform worse than a simple greedy algorithm. Furthermore, we show theoretically and demonstrate experimentally that the three proposed algorithms provide increasing solution quality at moderately increasing computational complexity, and show that they constitute better baselines for curriculum learning in reinforcement learning. Reinforcement Learning (RL) has recently been successfully applied to a number of tasks whose complexity would have appeared overwhelming only a few years ago [1], [2]. In such large and complex environments, classical exploration strategies designed for Markov Decision Processes (MDPs), aiming at visiting every state the most efficiently, are inadequate. One approach actively investigated is the use of transfer learning [3] to generalize from previous similar tasks, and more recently the application of transfer learning to sequences of tasks of increasing complexity forming a curriculum . Curriculum Learning is often employed in (Deep) RL [4], [5] to let the agent progress more quickly towards better behaviors, but curricula are mostly designed by hand. Curriculum learning has the potential to greatly increase the quality of the behavior discovered by the agent. However, at the moment, creating an appropriate curriculum requires significant human intuition.
Lower Bounds for Smooth Nonconvex Finite-Sum Optimization
Smooth finite-sum optimization has been widely studied in both convex and nonconvex settings. However, existing lower bounds for finite-sum optimization are mostly limited to the setting where each component function is (strongly) convex, while the lower bounds for nonconvex finite-sum optimization remain largely unsolved. In this paper, we study the lower bounds for smooth nonconvex finite-sum optimization, where the objective function is the average of $n$ nonconvex component functions. We prove tight lower bounds for the complexity of finding $\epsilon$-suboptimal point and $\epsilon$-approximate stationary point in different settings, for a wide regime of the smallest eigenvalue of the Hessian of the objective function (or each component function). Given our lower bounds, we can show that existing algorithms including KatyushaX (Allen-Zhu, 2018), Natasha (Allen-Zhu, 2017), RapGrad (Lan and Yang, 2018) and StagewiseKatyusha (Chen and Yang, 2018) have achieved optimal Incremental First-order Oracle (IFO) complexity (i.e., number of IFO calls) up to logarithm factors for nonconvex finite-sum optimization. We also point out potential ways to further improve these complexity results, in terms of making stronger assumptions or by a different convergence analysis.
Natural Analysts in Adaptive Data Analysis
Adaptive data analysis is frequently criticized for its pessimistic generalization guarantees. The source of these pessimistic bounds is a model that permits arbitrary, possibly adversarial analysts that optimally use information to bias results. While being a central issue in the field, still lacking are notions of natural analysts that allow for more optimistic bounds faithful to the reality that typical analysts aren't adversarial. In this work, we propose notions of natural analysts that smoothly interpolate between the optimal non-adaptive bounds and the best-known adaptive generalization bounds. To accomplish this, we model the analyst's knowledge as evolving according to the rules of an unknown dynamical system that takes in revealed information and outputs new statistical queries to the data. This allows us to restrict the analyst through different natural control-theoretic notions. One such notion corresponds to a recency bias, formalizing an inability to arbitrarily use distant information. Another complementary notion formalizes an anchoring bias, a tendency to weight initial information more strongly. Both notions come with quantitative parameters that smoothly interpolate between the non-adaptive case and the fully adaptive case, allowing for a rich spectrum of intermediate analysts that are neither non-adaptive nor adversarial. Natural not only from a cognitive perspective, we show that our notions also capture standard optimization methods, like gradient descent in various settings. This gives a new interpretation to the fact that gradient descent tends to overfit much less than its adaptive nature might suggest.