Goto

Collaborating Authors

 Optimization


Randomized Block-Diagonal Preconditioning for Parallel Learning

arXiv.org Machine Learning

We study preconditioned gradient-based optimization methods where the preconditioning matrix has block-diagonal form. Such a structural constraint comes with the advantage that the update computation is block-separable and can be parallelized across multiple independent tasks. Our main contribution is to demonstrate that the convergence of these methods can significantly be improved by a randomization technique which corresponds to repartitioning coordinates across tasks during the optimization procedure. We provide a theoretical analysis that accurately characterizes the expected convergence gains of repartitioning and validate our findings empirically on various traditional machine learning tasks. From an implementation perspective, block-separable models are well suited for parallelization and, when shared memory is available, randomization can be implemented on top of existing methods very efficiently to improve convergence.


Continuous Submodular Function Maximization

arXiv.org Machine Learning

Continuous submodular functions are a category of generally non-convex/non-concave functions with a wide spectrum of applications. The celebrated property of this class of functions - continuous submodularity - enables both exact minimization and approximate maximization in poly. time. Continuous submodularity is obtained by generalizing the notion of submodularity from discrete domains to continuous domains. It intuitively captures a repulsive effect amongst different dimensions of the defined multivariate function. In this paper, we systematically study continuous submodularity and a class of non-convex optimization problems: continuous submodular function maximization. We start by a thorough characterization of the class of continuous submodular functions, and show that continuous submodularity is equivalent to a weak version of the diminishing returns (DR) property. Thus we also derive a subclass of continuous submodular functions, termed continuous DR-submodular functions, which enjoys the full DR property. Then we present operations that preserve continuous (DR-)submodularity, thus yielding general rules for composing new submodular functions. We establish intriguing properties for the problem of constrained DR-submodular maximization, such as the local-global relation. We identify several applications of continuous submodular optimization, ranging from influence maximization, MAP inference for DPPs to provable mean field inference. For these applications, continuous submodularity formalizes valuable domain knowledge relevant for optimizing this class of objectives. We present inapproximability results and provable algorithms for two problem settings: constrained monotone DR-submodular maximization and constrained non-monotone DR-submodular maximization. Finally, we extensively evaluate the effectiveness of the proposed algorithms.


Safe Learning under Uncertain Objectives and Constraints

arXiv.org Artificial Intelligence

In this paper, we consider non-convex optimization problems under \textit{unknown} yet safety-critical constraints. Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures, where it is infeasible to know or identify all the constraints. Therefore, the parameter space should be explored in a conservative way to ensure that none of the constraints are violated during the optimization process once we start from a safe initialization point. To this end, we develop an algorithm called Reliable Frank-Wolfe (Reliable-FW). Given a general non-convex function and an unknown polytope constraint, Reliable-FW simultaneously learns the landscape of the objective function and the boundary of the safety polytope. More precisely, by assuming that Reliable-FW has access to a (stochastic) gradient oracle of the objective function and a noisy feasibility oracle of the safety polytope, it finds an $\epsilon$-approximate first-order stationary point with the optimal ${\mathcal{O}}({1}/{\epsilon^2})$ gradient oracle complexity (resp. $\tilde{\mathcal{O}}({1}/{\epsilon^3})$ (also optimal) in the stochastic gradient setting), while ensuring the safety of all the iterates. Rather surprisingly, Reliable-FW only makes $\tilde{\mathcal{O}}(({d^2}/{\epsilon^2})\log 1/\delta)$ queries to the noisy feasibility oracle (resp. $\tilde{\mathcal{O}}(({d^2}/{\epsilon^4})\log 1/\delta)$ in the stochastic gradient setting) where $d$ is the dimension and $\delta$ is the reliability parameter, tightening the existing bounds even for safe minimization of convex functions. We further specialize our results to the case that the objective function is convex. A crucial component of our analysis is to introduce and apply a technique called geometric shrinkage in the context of safe optimization.


A Framework for Fairness in Two-Sided Marketplaces

arXiv.org Artificial Intelligence

Many interesting problems in the Internet industry can be framed as a two-sided marketplace problem. Examples include search applications and recommender systems showing people, jobs, movies, products, restaurants, etc. Incorporating fairness while building such systems is crucial and can have deep social and economic impact (applications include job recommendations, recruiters searching for candidates, etc.). In this paper, we propose a definition and develop an end-to-end framework for achieving fairness while building such machine learning systems at scale. We extend prior work [29] to develop an optimization framework that can tackle fairness constraints from both the source and destination sides of the marketplace, as well as dynamic aspects of the problem. The framework is flexible enough to adapt to different definitions of fairness and can be implemented in very large-scale settings. We perform simulations to show the efficacy of our approach.


Differentiable Segmentation of Sequences

arXiv.org Machine Learning

Segmented models are widely used to describe non-stationary sequential data with discrete change points. Their estimation usually requires solving a mixed discrete-continuous optimization problem, where the segmentation is the discrete part and all other model parameters are continuous. A number of estimation algorithms have been developed that are highly specialized for their specific model assumptions. The dependence on non-standard algorithms makes it hard to integrate segmented models in state-of-the-art deep learning architectures that critically depend on gradient-based optimization techniques. In this work, we formulate a relaxed variant of segmented models that enables joint estimation of all model parameters, including the segmentation, with gradient descent. We build on recent advances in learning continuous warping functions and propose a novel family of warping functions based on the two-sided power (TSP) distribution. TSP-based warping functions are differentiable, have simple closed-form expressions, and can represent segmentation functions exactly. Our formulation includes the important class of segmented generalized linear models as a special case, which makes it highly versatile. We use our approach to model the spread of COVID-19 by segmented Poisson regression, perform logistic regression on Fashion-MNIST with artificial concept drift, and demonstrate its capacities for phoneme segmentation.


Sparse-RS: a versatile framework for query-efficient sparse black-box adversarial attacks

arXiv.org Machine Learning

A large body of research has focused on adversarial attacks which require to modify all input features with small $l_2$- or $l_\infty$-norms. In this paper we instead focus on query-efficient sparse attacks in the black-box setting. Our versatile framework, Sparse-RS, based on random search achieves state-of-the-art success rate and query efficiency for different sparse attack models such as $l_0$-bounded perturbations (outperforming established white-box methods), adversarial patches, and adversarial framing. We show the effectiveness of Sparse-RS on different datasets considering problems from image recognition and malware detection and multiple variations of sparse threat models, including targeted and universal perturbations. In particular Sparse-RS can be used for realistic attacks such as universal adversarial patch attacks without requiring a substitute model. The code of our framework is available at https://github.com/fra31/sparse-rs.


Exact Support Recovery in Federated Regression with One-shot Communication

arXiv.org Machine Learning

Federated learning provides a framework to address the challenges of distributed computing, data ownership and privacy over a large number of distributed clients with low computational and communication capabilities. In this paper, we study the problem of learning the exact support of sparse linear regression in the federated learning setup. We provide a simple communication efficient algorithm which only needs one-shot communication with the centralized server to compute the exact support. Our method does not require the clients to solve any optimization problem and thus, can be run on devices with low computational capabilities. Our method is naturally robust to the problems of client failure, model poisoning and straggling clients. We formally prove that our method requires a number of samples per client that is polynomial with respect to the support size, but independent of the dimension of the problem. We require the number of distributed clients to be logarithmic in the dimension of the problem. If the predictor variables are mutually independent then the overall sample complexity matches the optimal sample complexity of the non-federated centralized setting. Furthermore, our method is easy to implement and has an overall polynomial time complexity.


A Dynamical Systems Approach for Convergence of the Bayesian EM Algorithm

arXiv.org Machine Learning

Out of the recent advances in systems and control (S\&C)-based analysis of optimization algorithms, not enough work has been specifically dedicated to machine learning (ML) algorithms and its applications. This paper addresses this gap by illustrating how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based. The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM). Following first principles from dynamical systems stability theory, conditions for convergence of MAP-EM are developed. Furthermore, if additional assumptions are met, we show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S\&C approach. The convergence guarantees in this paper effectively expand the set of sufficient conditions for EM applications, thereby demonstrating the potential of similar S\&C-based convergence analysis of other ML algorithms.


Inexact Derivative-Free Optimization for Bilevel Learning

arXiv.org Machine Learning

Variational regularization techniques are dominant in the field of mathematical imaging. A drawback of these techniques is that they are dependent on a number of parameters which have to be set by the user. A by now common strategy to resolve this issue is to learn these parameters from data. While mathematically appealing this strategy leads to a nested optimization problem (known as bilevel optimization) which is computationally very difficult to handle. A key ingredient in solving the upper-level problem is the exact solution of the lower-level problem which is practically infeasible. In this work we propose to solve these problems using inexact derivative-free optimization algorithms which never require to solve the lower-level problem exactly. We provide global convergence and worst-case complexity analysis of our approach, and test our proposed framework on ROF-denoising and learning MRI sampling patterns. Dynamically adjusting the lower-level accuracy yields learned parameters with similar reconstruction quality as high-accuracy evaluations but with dramatic reductions in computational work (up to 100 times faster in some cases).


Beyond $\mathcal{O}(\sqrt{T})$ Regret for Constrained Online Optimization: Gradual Variations and Mirror Prox

arXiv.org Machine Learning

We study constrained online convex optimization, where the constraints consist of a relatively simple constraint set (e.g. a Euclidean ball) and multiple functional constraints. Projections onto such decision sets are usually computationally challenging. So instead of enforcing all constraints over each slot, we allow decisions to violate these functional constraints but aim at achieving a low regret and a low cumulative constraint violation over a horizon of $T$ time slot. The best known bound for solving this problem is $\mathcal{O}(\sqrt{T})$ regret and $\mathcal{O}(1)$ constraint violation, whose algorithms and analysis are restricted to Euclidean spaces. In this paper, we propose a new online primal-dual mirror prox algorithm whose regret is measured via a total gradient variation $V_*(T)$ over a sequence of $T$ loss functions. Specifically, we show that the proposed algorithm can achieve an $\mathcal{O}(\sqrt{V_*(T)})$ regret and $\mathcal{O}(1)$ constraint violation simultaneously. Such a bound holds in general non-Euclidean spaces, is never worse than the previously known $\big( \mathcal{O}(\sqrt{T}), \mathcal{O}(1) \big)$ result, and can be much better on regret when the variation is small. Furthermore, our algorithm is computationally efficient in that only two mirror descent steps are required during each slot instead of solving a general Lagrangian minimization problem. Along the way, our bounds also improve upon those of previous attempts using mirror-prox-type algorithms solving this problem, which yield a relatively worse $\mathcal{O}(T^{2/3})$ regret and $\mathcal{O}(T^{2/3})$ constraint violation.