Optimization
Should Algorithms for Random SAT and Max-SAT be Different?
We analyze to what extent the random SAT and Max-SAT problems differ in their properties. Our findings suggest that for random $k$-CNF with ratio in a certain range, Max-SAT can be solved by any SAT algorithm with subexponential slowdown, while for formulae with ratios greater than some constant, algorithms under the random walk framework require substantially different heuristics. In light of these results, we propose a novel probabilistic approach for random Max-SAT called ProMS. Experimental results illustrate that ProMS outperforms many state-of-the-art local search solvers on random Max-SAT benchmarks.
Efficient Online Hyperparameter Optimization for Kernel Ridge Regression with Applications to Traffic Time Series Prediction
Zhan, Hongyuan, Gomes, Gabriel, Li, Xiaoye S., Madduri, Kamesh, Wu, Kesheng
Modern sensors generate large amounts of timestamped measurement data. These data sets are critical in a wide range of applications including traffic flow prediction, transportation management, GPS navigation, and city planning. Machine learning-based prediction algorithms typically adjust their parameters automatically based on the data, but also require users to set additional parameters, known as hyperparameters. For example, in a kernel-based regression model, the (ordinary) parameters are the regression weights, whereas the hyperparameters include the kernel scales and regularization constants. These hyperparameters have a strong influence on the prediction accuracy. Often, their values are set based on past experience or through time-consuming grid searches. In applications where the characteristics of the data change, such as unusual traffic pattern due to upcoming concert events, these hyperparameters have to be adjusted dynamically in order to maintain prediction quality. In this paper, we use the term hyperparameter learning, hyperparameter optimization, and hyperparameter selection/tuning interchangeably, referring to the process of configuring the model specification before model fitting.
Functional Nonlinear Sparse Models
Chamon, Luiz F. O., Eldar, Yonina C., Ribeiro, Alejandro
Signal processing in inherently continuous and often nonlinear applications, such as radar, magnetic resonance imaging, and super-resolution microscopy, in which sparsity plays a key role in obtaining state-of-the-art results. Coping with the infinite dimensionality and non-convexity of these estimation problems typically involves discretization and convex relaxations, e.g., using atomic norms. Although successful, this approaches are not without issues. Discretization often leads to high dimensional, potentially ill-conditioned optimization problems. Moreover, due to grid mismatch and other coherence issues, a sparse signal in the continuous domain may no longer be sparse when discretized. Finally, nonlinear problems remain non-convex even after relaxing the sparsity objective. And even in the linear case, performance guarantees for atomic norm relaxations hold under assumptions that may be hard to meet in practice. We propose to address these issues by directly tackling the continuous, nonlinear problem cast as a sparse functional optimization program. We prove that these problems have no duality gap and show that they can be solved efficiently using duality and a (stochastic) subgradient ascent-type algorithm. We illustrate the wide range of applications for this new approach by formulating and solving sparse problems in super-resolution (nonlinear line spectral estimation) and vector field estimation (spectrum cartography).
Deep Structured Prediction with Nonlinear Output Transformations
Graber, Colin, Meshi, Ofer, Schwing, Alexander
Deep structured models are widely used for tasks like semantic segmentation, where explicit correlations between variables provide important prior information which generally helps to reduce the data needs of deep nets. However, current deep structured models are restricted by oftentimes very local neighborhood structure, which cannot be increased for computational complexity reasons, and by the fact that the output configuration, or a representation thereof, cannot be transformed further. Very recent approaches which address those issues include graphical model inference inside deep nets so as to permit subsequent non-linear output space transformations. However, optimization of those formulations is challenging and not well understood. Here, we develop a novel model which generalizes existing approaches, such as structured prediction energy networks, and discuss a formulation which maintains applicability of existing inference techniques.
Adversarially Robust Optimization with Gaussian Processes
Bogunovic, Ilija, Scarlett, Jonathan, Jegelka, Stefanie, Cevher, Volkan
In this paper, we consider the problem of Gaussian process (GP) optimization with an added robustness requirement: The returned point may be perturbed by an adversary, and we require the function value to remain as high as possible even after this perturbation. This problem is motivated by settings in which the underlying functions during optimization and implementation stages are different, or when one is interested in finding an entire region of good inputs rather than only a single point. We show that standard GP optimization algorithms do not exhibit the desired robustness properties, and provide a novel confidence-bound based algorithm StableOpt for this purpose. We rigorously establish the required number of samples for StableOpt to find a near-optimal point, and we complement this guarantee with an algorithm-independent lower bound. We experimentally demonstrate several potential applications of interest using real-world data sets, and we show that StableOpt consistently succeeds in finding a stable maximizer where several baseline methods fail.
Understanding the Acceleration Phenomenon via High-Resolution Differential Equations
Shi, Bin, Du, Simon S., Jordan, Michael I., Su, Weijie J.
Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms---Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the identification of a term that we refer to as "gradient correction" that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov's accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result---that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.
Differentiable MPC for End-to-end Planning and Control
Amos, Brandon, Rodriguez, Ivan Dario Jimenez, Sacks, Jacob, Boots, Byron, Kolter, J. Zico
This provides one way of leveraging and combining the advantages of model-free and model-based approaches. Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. Using this strategy, we are able to learn the cost and dynamics of a controller via end-to-end learning. Our experiments focus on imitation learning in the pendulum and cartpole domains, where we learn the cost and dynamics terms of an MPC policy class. We show that our MPC policies are significantly more data-efficient than a generic neural network and that our method is superior to traditional system identification in a setting where the expert is unrealizable.
Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation
O'Kelly, Matthew, Sinha, Aman, Namkoong, Hongseok, Duchi, John, Tedrake, Russ
Recent breakthroughs in deep learning have accelerated the development of autonomous vehicles (AVs); many research prototypes now operate on real roads alongside human drivers. While advances in computer-vision techniques have made human-level performance possible on narrow perception tasks such as object recognition, several fatal accidents involving AVs underscore the importance of testing whether the perception and control pipeline--when considered as a whole system--can safely interact with humans. Unfortunately, testing AVs in real environments, the most straightforward validation framework for system-level input-output behavior, requires prohibitive amounts of time due to the rare nature of serious accidents [49]. Concretely, a recent study [29] argues that AVs need to drive "hundreds of millions of miles and, under some scenarios, hundreds of billions of miles to create enough data to clearly demonstrate their safety." Alteratively, formally verifying an AV algorithm's "correctness" [34, 2, 47, 37] is difficult since all driving policies are subject to crashes caused by other drivers [49]. It is unreasonable to ask that the policy be safe under all scenarios. Unfortunately, ruling out scenarios where the AV should not be blamed is a task subject to logical inconsistency, combinatorial growth in specification complexity, and subjective assignment of fault. Motivated by the challenges underlying real-world testing and formal verification, we consider a probabilistic paradigm--which we call a risk-based framework--where our goal is to evaluate the probability of an accident under a base distribution representing standard traffic behavior.
Multiple Measurement Vectors Problem: A Decoupling Property and its Applications
Haghighatshoar, Saeid, Caire, Giuseppe
Efficient and reliable estimation in many signal processing problems encountered in applications requires adopting sparsity prior in a suitable basis on the signals and using techniques from compressed sensing (CS). In this paper, we study a CS problem known as Multiple Measurement Vectors (MMV) problem, which arises in joint estimation of multiple signal realizations when the signal samples have a common (joint) support over a fixed known dictionary. Although there is a vast literature on the analysis of MMV, it is not yet fully known how the number of signal samples and their statistical correlations affects the performance of the joint estimation in MMV. Moreover, in many instances of MMV the underlying sparsifying dictionary may not be precisely known, and it is still an open problem to quantify how the dictionary mismatch may affect the estimation performance. In this paper, we focus on $\ell_{2,1}$-norm regularized least squares ($\ell_{2,1}$-LS) as a well-known and widely-used MMV algorithm in the literature. We prove an interesting decoupling property for $\ell_{2,1}$-LS, where we show that it can be decomposed into two phases: i) use all the signal samples to estimate the signal covariance matrix (coupled phase), ii) plug in the resulting covariance estimate as the true covariance matrix into the Minimum Mean Squared Error (MMSE) estimator to reconstruct each signal sample individually (decoupled phase). As a consequence of this decomposition, we are able to provide further insights on the performance of $\ell_{2,1}$-LS for MMV. In particular, we address how the signal correlations and dictionary mismatch affects its estimation performance. We also provide numerical simulations to validate our theoretical results.
A general system of differential equations to model first order adaptive algorithms
da Silva, André Belotto, Gazeau, Maxime
First order optimization algorithms play a major role in large scale machine learning. A new class of methods, called adaptive algorithms, were recently introduced to adjust iteratively the learning rate for each coordinate. Despite great practical success in deep learning, their behavior and performance on more general loss functions are not well understood. In this paper, we derive a non-autonomous system of differential equations, which is the continuous time limit of adaptive optimization methods. We prove global well-posedness of the system and we investigate the numerical time convergence of its forward Euler approximation. We study, furthermore, the convergence of its trajectories and give conditions under which the differential system, underlying all adaptive algorithms, is suitable for optimization. We discuss convergence to a critical point in the non-convex case and give conditions for the dynamics to avoid saddle points and local maxima. For convex and deterministic loss function, we introduce a suitable Lyapunov functional which allow us to study its rate of convergence. Several other properties of both the continuous and discrete systems are briefly discussed. The differential system studied in the paper is general enough to encompass many other classical algorithms (such as Heavy ball and Nesterov's accelerated method) and allow us to recover several known results for these algorithms.