Goto

Collaborating Authors

 Optimization


Group-Structured Adversarial Training

arXiv.org Machine Learning

Robust training methods against perturbations to the input data have received great attention in the machine learning literature. A standard approach in this direction is adversarial training which learns a model using adversarially-perturbed training samples. However, adversarial training performs suboptimally against perturbations structured across samples such as universal and group-sparse shifts that are commonly present in biological data such as gene expression levels of different tissues. In this work, we seek to close this optimality gap and introduce Group-Structured Adversarial Training (GSAT) which learns a model robust to perturbations structured across samples. We formulate GSAT as a non-convex concave minimax optimization problem which minimizes a group-structured optimal transport cost. Specifically, we focus on the applications of GSAT for group-sparse and rank-constrained perturbations modeled using group and nuclear norm penalties. In order to solve GSAT's non-smooth optimization problem in those cases, we propose a new minimax optimization algorithm called GDADMM by combining Gradient Descent Ascent (GDA) and Alternating Direction Method of Multipliers (ADMM). We present several applications of the GSAT framework to gain robustness against structured perturbations for image recognition and computational biology datasets.


The Perils of Learning Before Optimizing

arXiv.org Artificial Intelligence

Formulating real-world optimization problems often begins with making predictions from historical data (e.g., an optimizer that aims to recommend fast routes relies upon travel-time predictions). Typically, learning the prediction model used to generate the optimization problem and solving that problem are performed in two separate stages. Recent work has showed how such prediction models can be learned end-to-end by differentiating through the optimization task. Such methods often yield empirical improvements, which are typically attributed to end-to-end making better error tradeoffs than the standard loss function used in a two-stage solution. We refine this explanation and more precisely characterize when end-to-end can improve performance. When prediction targets are stochastic, a two-stage solution must make an a priori choice about which statistics of the target distribution to model -- we consider expectations over prediction targets -- while an end-to-end solution can make this choice adaptively. We show that the performance gap between a two-stage and end-to-end approach is closely related to the \emph{price of correlation} concept in stochastic optimization and show the implications of some existing POC results for our predict-then-optimize problem. We then consider a novel and particularly practical setting, where coefficients in the objective function depend on multiple prediction targets. We give explicit constructions where (1) two-stage performs unboundedly worse than end-to-end; and (2) two-stage is optimal. We identify a large set of real-world applications whose objective functions rely on multiple prediction targets but which nevertheless deploy two-stage solutions. We also use simulations to experimentally quantify performance gaps.


Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

arXiv.org Machine Learning

Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms.


On the Role of Entropy-based Loss for Learning Causal Structures with Continuous Optimization

arXiv.org Artificial Intelligence

Causal discovery from observational data is an important but challenging task in many scientific fields. Recently, NOTEARS [Zheng et al., 2018] formulates the causal structure learning problem as a continuous optimization problem using least-square loss with an acyclicity constraint. Though the least-square loss function is well justified under the standard Gaussian noise assumption, it is limited if the assumption does not hold. In this work, we theoretically show that the violation of the Gaussian noise assumption will hinder the causal direction identification, making the causal orientation fully determined by the causal strength as well as the variances of noises in the linear case and the noises of strong non-Gaussianity in the nonlinear case. Consequently, we propose a more general entropy-based loss that is theoretically consistent with the likelihood score under any noise distribution. We run extensive empirical evaluations on both synthetic data and real-world data to validate the effectiveness of the proposed method and show that our method achieves the best in Structure Hamming Distance, False Discovery Rate, and True Positive Rate matrices.


Algorithmic Bias and Data Bias: Understanding the Relation between Distributionally Robust Optimization and Data Curation

arXiv.org Machine Learning

Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Given the importance of bias mitigation in machine learning, the topic leads to contentious debates on how to ensure fairness in practice (data bias versus algorithmic bias). Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions, as well as convex and non-convex loss functions. We show that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation: in the same way that there is no universally robust training set, there is no universal way to setup a DRO problem and ensure a socially acceptable set of results. We then leverage these insights to provide a mininal set of practical recommendations for addressing bias with DRO. Finally, we discuss ramifications of our results in other related applications of DRO, using an example of adversarial robustness. Our results show that there is merit to both the algorithm-focused and the data-focused side of the bias debate, as long as arguments in favor of these positions are precisely qualified and backed by relevant mathematics known today.


Taming Nonconvexity in Kernel Feature Selection---Favorable Properties of the Laplace Kernel

arXiv.org Machine Learning

Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $\ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples.


How Low Can We Go: Trading Memory for Error in Low-Precision Training

arXiv.org Artificial Intelligence

However, we pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. As applications proliferate, users must choose which precision to use to train a new model, and chip manufacturers must decide which precisions to manufacture. We view these precision choices as a hyperparameter tuning problem, and borrow ideas from meta-learning to learn the tradeoff between memory and error. In this paper, we introduce Pareto Estimation to Pick the Perfect Precision (PEPPP). We use matrix factorization to find non-dominated configurations (the Pareto frontier) with a limited number of network evaluations. For any given memory budget, the precision that minimizes error is a point on this frontier. Practitioners can use the frontier to trade memory for error and choose the best precision for their goals.


Maxmin-Fair Ranking: Individual Fairness under Group-Fairness Constraints

arXiv.org Artificial Intelligence

The bulk of the algorithmic fairness literature deals with group fairness along the lines of demographic parity [9] or equal opportunity We study a novel problem of fairness in ranking aimed at minimizing [16]: this is typically expressed by means of some fairness the amount of individual unfairness introduced when enforcing constraint requiring that the top-positions (for any) in the ranking group-fairness constraints. Our proposal is rooted in the contain enough elements from some groups that are protected distributional maxmin fairness theory, which uses randomization from discrimination based on sex, race, age, etc. In fact, [6] shows to maximize the expected satisfaction of the worst-off individuals.


TSO: Curriculum Generation using continuous optimization

arXiv.org Artificial Intelligence

The training of deep learning models poses vast challenges of including parameter tuning and ordering of training data. Significant research has been done in Curriculum learning for optimizing the sequence of training data. Recent works have focused on using complex reinforcement learning techniques to find the optimal data ordering strategy to maximize learning for a given network. In this paper, we present a simple and efficient technique based on continuous optimization. We call this new approach Training Sequence Optimization (TSO). There are three critical components in our proposed approach: (a) An encoder network maps/embeds training sequence into continuous space. (b) A predictor network uses the continuous representation of a strategy as input and predicts the accuracy for fixed network architecture. (c) A decoder further maps a continuous representation of a strategy to the ordered training dataset. The performance predictor and encoder enable us to perform gradient-based optimization in the continuous space to find the embedding of optimal training data ordering with potentially better accuracy. Experiments show that we can gain 2AP with our generated optimal curriculum strategy over the random strategy using the CIFAR-100 dataset and have better boosts than the state of the art CL algorithms. We do an ablation study varying the architecture, dataset and sample sizes showcasing our approach's robustness.


Optimum-statistical collaboration towards efficient black-box optimization

arXiv.org Machine Learning

With increasingly more hyperparameters involved in their training, machine learning systems demand a better understanding of hyperparameter tuning automation. This has raised interest in studies of provably black-box optimization, which is made more practical by better exploration mechanism implemented in algorithm design, managing the flux of both optimization and statistical errors. Prior efforts focus on delineating optimization errors, but this is deficient: black-box optimization algorithms can be inefficient without considering heterogeneity among reward samples. In this paper, we make the key delineation on the role of statistical uncertainty in black-box optimization, guiding a more efficient algorithm design. We introduce \textit{optimum-statistical collaboration}, a framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process. Inspired by this framework, we propose the \texttt{VHCT} algorithms for objective functions with only local-smoothness assumptions. In theory, we prove our algorithm enjoys rate-optimal regret bounds; in experiments, we show the algorithm outperforms prior efforts in extensive settings.