Goto

Collaborating Authors

 Optimization


A Dimension-Insensitive Algorithm for Stochastic Zeroth-Order Optimization

arXiv.org Machine Learning

This paper concerns a convex, stochastic zeroth-order optimization (S-ZOO) problem, where the objective is to minimize the expectation of a cost function and its gradient is not accessible directly. To solve this problem, traditional optimization techniques mostly yield query complexities that grow polynomially with dimensionality, i.e., the number of function evaluations is a polynomial function of the number of decision variables. Consequently, these methods may not perform well in solving massive-dimensional problems arising in many modern applications. Although more recent methods can be provably dimension-insensitive, almost all of them work with arguably more stringent conditions such as everywhere sparse or compressible gradient. Thus, prior to this research, it was unknown whether dimension-insensitive S-ZOO is possible without such conditions. In this paper, we give an affirmative answer to this question by proposing a sparsity-inducing stochastic gradient-free (SI-SGF) algorithm. It is proved to achieve dimension-insensitive query complexity in both convex and strongly convex cases when neither gradient sparsity nor gradient compressibility is satisfied. Our numerical results demonstrate the strong potential of the proposed SI-SGF compared with existing alternatives.


Converting ADMM to a Proximal Gradient for Convex Optimization Problems

arXiv.org Machine Learning

In machine learning and data science, we often consider efficiency for solving problems. In sparse estimation, such as fused lasso and convex clustering, we apply either the proximal gradient method or the alternating direction method of multipliers (ADMM) to solve the problem. It takes time to include matrix division in the former case, while an efficient method such as FISTA (fast iterative shrinkage-thresholding algorithm) has been developed in the latter case. This paper proposes a general method for converting the ADMM solution to the proximal gradient method, assuming that the constraints and objectives are strongly convex. Then, we apply it to sparse estimation problems, such as sparse convex clustering and trend filtering, and we show by numerical experiments that we can obtain a significant improvement in terms of efficiency.


Conditional Selective Inference for Robust Regression and Outlier Detection using Piecewise-Linear Homotopy Continuation

arXiv.org Machine Learning

In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.


Asymmetric compressive learning guarantees with applications to quantized sketches

arXiv.org Machine Learning

The compressive learning framework reduces the computational cost of training on large-scale datasets. In a sketching phase, the data is first compressed to a lightweight sketch vector, obtained by mapping the data samples through a well-chosen feature map, and averaging those contributions. In a learning phase, the desired model parameters are then extracted from this sketch by solving an optimization problem, which also involves a feature map. When the feature map is identical during the sketching and learning phases, formal statistical guarantees (excess risk bounds) have been proven. However, the desirable properties of the feature map are different during sketching and learning (e.g. quantized outputs, and differentiability, respectively). We thus study the relaxation where this map is allowed to be different for each phase. First, we prove that the existing guarantees carry over to this asymmetric scheme, up to a controlled error term, provided some Limited Projected Distortion (LPD) property holds. We then instantiate this framework to the setting of quantized sketches, by proving that the LPD indeed holds for binary sketch contributions. Finally, we further validate the approach with numerical simulations, including a large-scale application in audio event classification.


Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020

arXiv.org Artificial Intelligence

This paper presents the results and insights from the black-box optimization (BBO) challenge at NeurIPS 2020 which ran from July-October, 2020. The challenge emphasized the importance of evaluating derivative-free optimizers for tuning the hyperparameters of machine learning models. This was the first black-box optimization challenge with a machine learning emphasis. It was based on tuning (validation set) performance of standard machine learning models on real datasets. This competition has widespread impact as black-box optimization (e.g., Bayesian optimization) is relevant for hyperparameter tuning in almost every machine learning project as well as many applications outside of machine learning. The final leaderboard was determined using the optimization performance on held-out (hidden) objective functions, where the optimizers ran without human intervention. Baselines were set using the default settings of several open-source black-box optimization packages as well as random search.


Demystify Optimization Challenges in Multilingual Transformers

arXiv.org Artificial Intelligence

Multilingual Transformer improves parameter efficiency and crosslingual transfer. How to effectively train multilingual models has not been well studied. Using multilingual machine translation as a testbed, we study optimization challenges from loss landscape and parameter plasticity perspectives. We found that imbalanced training data poses task interference between high and low resource languages, characterized by nearly orthogonal gradients for major parameters and the optimization trajectory being mostly dominated by high resource. We show that local curvature of the loss surface affects the degree of interference, and existing heuristics of data subsampling implicitly reduces the sharpness, although still face a trade-off between high and low resource languages. We propose a principled multi-objective optimization algorithm, Curvature Aware Task Scaling (CATS), which improves both optimization and generalization especially for low resource. Experiments on TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Pareto front of accuracy while being efficient to apply to massive multilingual settings at the scale of 100 languages.


Randomized Algorithms for Scientific Computing (RASC)

arXiv.org Artificial Intelligence

Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.


Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information

arXiv.org Artificial Intelligence

In many real world problems, we want to infer some property of an expensive black-box function f, given a budget of T function evaluations. One example is budget constrained global optimization of f, for which Bayesian optimization is a popular method. Other properties of interest include local optima, level sets, integrals, or graph-structured information induced by f. Often, we can find an algorithm A to compute the desired property, but it may require far more than T queries to execute. Given such an A, and a prior distribution over f, we refer to the problem of inferring the output of A using T evaluations as Bayesian Algorithm Execution (BAX). To tackle this problem, we present a procedure, InfoBAX, that sequentially chooses queries that maximize mutual information with respect to the algorithm's output. Applying this to Dijkstra's algorithm, for instance, we infer shortest paths in synthetic and real-world graphs with black-box edge costs. Using evolution strategies, we yield variants of Bayesian optimization that target local, rather than global, optima. On these problems, InfoBAX uses up to 500 times fewer queries to f than required by the original algorithm. Our method is closely connected to other Bayesian optimal experimental design procedures such as entropy search methods and optimal sensor placement using Gaussian processes.


Complexity Lower Bounds for Nonconvex-Strongly-Concave Min-Max Optimization

arXiv.org Machine Learning

We provide a first-order oracle complexity lower bound for finding stationary points of min-max optimization problems where the objective function is smooth, nonconvex in the minimization variable, and strongly concave in the maximization variable. We establish a lower bound of $\Omega\left(\sqrt{\kappa}\epsilon^{-2}\right)$ for deterministic oracles, where $\epsilon$ defines the level of approximate stationarity and $\kappa$ is the condition number. Our analysis shows that the upper bound achieved in (Lin et al., 2020b) is optimal in the $\epsilon$ and $\kappa$ dependence up to logarithmic factors. For stochastic oracles, we provide a lower bound of $\Omega\left(\sqrt{\kappa}\epsilon^{-2} + \kappa^{1/3}\epsilon^{-4}\right)$. It suggests that there is a significant gap between the upper bound $\mathcal{O}(\kappa^3 \epsilon^{-4})$ in (Lin et al., 2020a) and our lower bound in the condition number dependence.


A Rank based Adaptive Mutation in Genetic Algorithm

arXiv.org Artificial Intelligence

Traditionally Genetic Algorithm has been used for optimization of unimodal and multimodal functions. Earlier researchers worked with constant probabilities of GA control operators like crossover, mutation etc. for tuning the optimization in specific domains. Recent advancements in this field witnessed adaptive approach in probability determination. In Adaptive mutation primarily poor individuals are utilized to explore state space, so mutation probability is usually generated proportionally to the difference between fitness of best chromosome and itself (fMAX - f). However, this approach is susceptible to nature of fitness distribution during optimization. This paper presents an alternate approach of mutation probability generation using chromosome rank to avoid any susceptibility to fitness distribution. Experiments are done to compare results of simple genetic algorithm (SGA) with constant mutation probability and adaptive approaches within a limited resource constraint for unimodal, multimodal functions and Travelling Salesman Problem (TSP). Measurements are done for average best fitness, number of generations evolved and percentage of global optimum achievements out of several trials. The results demonstrate that the rank-based adaptive mutation approach is superior to fitness-based adaptive approach as well as SGA in a multimodal problem space.