Optimization
Beyond the Worst-Case Analysis of Algorithms (Introduction)
One of the primary goals of the mathematical analysis of algorithms is to provide guidance about which algorithm is the "best" for solving a given computational problem. Worst-case analysis summarizes the performance profile of an algorithm by its worst performance on any input of a given size, implicitly advocating for the algorithm with the best-possible worst-case performance. Strong worst-case guarantees are the holy grail of algorithm design, providing an application-agnostic certification of an algorithm's robustly good performance. However, for many fundamental problems and performance measures, such guarantees are impossible and a more nuanced analysis approach is called for. This chapter surveys several alternatives to worst-case analysis that are discussed in detail later in the book.
Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning
Forestier, Sรฉbastien, Portelas, Rรฉmy, Mollard, Yoan, Oudeyer, Pierre-Yves
Intrinsically motivated spontaneous exploration is a key enabler of autonomous lifelong learning in human children. It enables the discovery and acquisition of large repertoires of skills through self-generation, self-selection, self-ordering and self-experimentation of learning goals. We present an algorithmic approach called Intrinsically Motivated Goal Exploration Processes (IMGEP) to enable similar properties of autonomous or self-supervised learning in machines. The IMGEP algorithmic architecture relies on several principles: 1) self-generation of goals, generalized as fitness functions; 2) selection of goals based on intrinsic rewards; 3) exploration with incremental goal-parameterized policy search and exploitation of the gathered data with a batch learning algorithm; 4) systematic reuse of information acquired when targeting a goal for improving towards other goals. We present a particularly efficient form of IMGEP, called Modular Population-Based IMGEP, that uses a population-based policy and an object-centered modularity in goals and mutations. We provide several implementations of this architecture and demonstrate their ability to automatically generate a learning curriculum within several experimental setups including a real humanoid robot that can explore multiple spaces of goals with several hundred continuous dimensions. While no particular target goal is provided to the system, this curriculum allows the discovery of skills that act as stepping stone for learning more complex skills, e.g. nested tool use. We show that learning diverse spaces of goals with intrinsic motivations is more efficient for learning complex skills than only trying to directly learn these complex skills.
Anticipating the Long-Term Effect of Online Learning in Control
Capone, Alexandre, Hirche, Sandra
Control schemes that learn using measurement data collected online are increasingly promising for the control of complex and uncertain systems. However, in most approaches of this kind, learning is viewed as a side effect that passively improves control performance, e.g., by updating a model of the system dynamics. Determining how improvements in control performance due to learning can be actively exploited in the control synthesis is still an open research question. In this paper, we present AntLer, a design algorithm for learning-based control laws that anticipates learning, i.e., that takes the impact of future learning in uncertain dynamic settings explicitly into account. AntLer expresses system uncertainty using a non-parametric probabilistic model. Given a cost function that measures control performance, AntLer chooses the control parameters such that the expected cost of the closed-loop system is minimized approximately. We show that AntLer approximates an optimal solution arbitrarily accurately with probability one. Furthermore, we apply AntLer to a nonlinear system, which yields better results compared to the case where learning is not anticipated.
Positive Semidefinite Matrix Factorization: A Connection with Phase Retrieval and Affine Rank Minimization
Lahat, Dana, Lang, Yanbin, Tan, Vincent Y. F., Fรฉvotte, Cรฉdric
Positive semidefinite matrix factorization (PSDMF) expresses each entry of a nonnegative matrix as the inner product of two positive semidefinite (psd) matrices. When all these psd matrices are constrained to be diagonal, this model is equivalent to nonnegative matrix factorization. Applications include combinatorial optimization, quantum-based statistical models, and recommender systems, among others. However, despite the increasing interest in PSDMF, only a few PSDMF algorithms were proposed in the literature. In this paper, we show that PSDMF algorithms can be designed based on phase retrieval (PR) and affine rank minimization (ARM) algorithms. This procedure allows a significant shortcut in designing new PSDMF algorithms, as it allows to leverage some of the useful numerical properties of existing PR and ARM methods to the PSDMF framework. Motivated by this idea, we introduce a new family of PSDMF algorithms based on singular value projection (SVP) and iterative hard thresholding (IHT). This family subsumes previously-proposed projected gradient PSDMF methods; additionally, we show a new connection between SVP-based methods and majorization-minimization. Numerical experiments show that our proposed methods outperform state-of-the-art coordinate descent algorithms in terms of convergence speed and computational complexity, in certain scenarios. In certain cases, our proposed normalized-IHT-based method is the only algorithm able to find a solution. These results support our claim that the PSDMF framework can inherit desired numerical properties from PR and ARM algorithms, leading to more efficient PSDMF algorithms, and motivate further study of the links between these models.
Off-Policy Evaluation via the Regularized Lagrangian
Yang, Mengjiao, Nachum, Ofir, Dai, Bo, Li, Lihong, Schuurmans, Dale
The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data. While these estimators all perform some form of stationary distribution correction, they arise from different derivations and objective functions. In this paper, we unify these estimators as regularized Lagrangians of the same linear program. The unification allows us to expand the space of DICE estimators to new alternatives that demonstrate improved performance. More importantly, by analyzing the expanded space of estimators both mathematically and empirically we find that dual solutions offer greater flexibility in navigating the tradeoff between optimization stability and estimation bias, and generally provide superior estimates in practice.
Learning the Solution Manifold in Optimization and Its Application in Motion Planning
Optimization is an essential component for solving problems in wide-ranging fields. Ideally, the objective function should be designed such that the solution is unique and the optimization problem can be solved stably. However, the objective function used in a practical application is usually non-convex, and sometimes it even has an infinite set of solutions. To address this issue, we propose to learn the solution manifold in optimization. We train a model conditioned on the latent variable such that the model represents an infinite set of solutions. In our framework, we reduce this problem to density estimation by using importance sampling, and the latent representation of the solutions is learned by maximizing the variational lower bound. We apply the proposed algorithm to motion-planning problems, which involve the optimization of high-dimensional parameters. The experimental results indicate that the solution manifold can be learned with the proposed algorithm, and the trained model represents an infinite set of homotopic solutions for motion-planning problems.
Group-Fair Online Allocation in Continuous Time
Cayci, Semih, Gupta, Swati, Eryilmaz, Atilla
The theory of discrete-time online learning has been successfully applied in many problems that involve sequential decision-making under uncertainty. However, in many applications including contractual hiring in online freelancing platforms and server allocation in cloud computing systems, the outcome of each action is observed only after a random and action-dependent time. Furthermore, as a consequence of certain ethical and economic concerns, the controller may impose deadlines on the completion of each task, and require fairness across different groups in the allocation of total time budget $B$. In order to address these applications, we consider continuous-time online learning problem with fairness considerations, and present a novel framework based on continuous-time utility maximization. We show that this formulation recovers reward-maximizing, max-min fair and proportionally fair allocation rules across different groups as special cases. We characterize the optimal offline policy, which allocates the total time between different actions in an optimally fair way (as defined by the utility function), and impose deadlines to maximize time-efficiency. In the absence of any statistical knowledge, we propose a novel online learning algorithm based on dual ascent optimization for time averages, and prove that it achieves $\tilde{O}(B^{-1/2})$ regret bound.
Robust Control Synthesis and Verification for Wire-Borne Underactuated Brachiating Robots Using Sum-of-Squares Optimization
Farzan, Siavash, Hu, Ai-Ping, Bick, Michael, Rogers, Jonathan
Control of wire-borne underactuated brachiating robots requires a robust feedback control design that can deal with dynamic uncertainties, actuator constraints and unmeasurable states. In this paper, we develop a robust feedback control for brachiating on flexible cables, building on previous work on optimal trajectory generation and time-varying LQR controller design. We propose a novel simplified model for approximation of the flexible cable dynamics, which enables inclusion of parametric model uncertainties in the system. We then use semidefinite programming (SDP) and sum-of-squares (SOS) optimization to synthesize a time-varying feedback control with formal robustness guarantees to account for model uncertainties and unmeasurable states in the system. Through simulation, hardware experiments and comparison with a time-varying LQR controller, it is shown that the proposed robust controller results in relatively large robust backward reachable sets and is able to reliably track a pre-generated optimal trajectory and achieve the desired brachiating motion in the presence of parametric model uncertainties, actuator limits, and unobservable states.
Online Boosting with Bandit Feedback
We consider the problem of online boosting for regression tasks, when only limited information is available to the learner. We give an efficient regret minimization method that has two implications: an online boosting algorithm with noisy multi-point bandit feedback, and a new projection-free online convex optimization algorithm with stochastic gradient, that improves state-of-the-art guarantees in terms of efficiency.
Batch Policy Learning in Average Reward Markov Decision Processes
Liao, Peng, Qi, Zhengling, Murphy, Susan
We study the problem of policy optimization in Markov Decision Process over infinite time horizons (Puterman, 1994). We focus on the batch (i.e., off-line) setting, where historical data of multiple trajectories has been previously collected using some behavior policy. Our goal is to learn a new policy with guaranteed performance when implemented in the future. In this work, we develop a data-efficient method to learn the policy that optimizes the long-term average reward in a pre-specified policy class from a training set composed of multiple trajectories. Furthermore, we establish a finite-sample regret guarantee, i.e., the difference between the average reward of the optimal policy in the class and the average reward of the estimated policy by our proposed method. This work is motivated by the development of justin-time adaptive intervention in mobile health (mHealth) applications (Nahum-Shani et al., 2017). Our method can be used to learn a treatment policy that maps the real-time collected information about the individual's status and context to a particular treatment at each of many decision times to support health behaviors.