Goto

Collaborating Authors

 Optimization


FLO: Fast and Lightweight Hyperparameter Optimization for AutoML

arXiv.org Machine Learning

Integrating ML models in software is of growing interest. Building accurate models requires right choice of hyperparameters for training procedures (learners), when the training dataset is given. AutoML tools provide APIs to automate the choice, which usually involve many trials of different hyperparameters for a given training dataset. Since training and evaluation of complex models can be time and resource consuming, existing AutoML solutions require long time or large resource to produce accurate models for large scale training data. That prevents AutoML to be embedded in a software which needs to repeatedly tune hyperparameters and produce models to be consumed by other components, such as large-scale data systems. We present a fast and lightweight hyperparameter optimization method FLO and use it to build an efficient AutoML solution. Our method optimizes for minimal evaluation cost instead of number of iterations to find accurate models. Our main idea is to leverage a holistic consideration of the relations among model complexity, evaluation cost and accuracy. FLO has a strong anytime performance and significantly outperforms Bayesian Optimization and random search for hyperparameter tuning on a large open source AutoML Benchmark. Our AutoML solution also outperforms top-ranked AutoML libraries in a majority of the tasks on this benchmark.


On Robustness to Adversarial Examples and Polynomial Optimization

arXiv.org Machine Learning

We study the design of computationally efficient algorithms with provable guarantees, that are robust to adversarial (test time) perturbations. While there has been an proliferation of recent work on this topic due to its connections to test time robustness of deep networks, there is limited theoretical understanding of several basic questions like (i) when and how can one design provably robust learning algorithms? (ii) what is the price of achieving robustness to adversarial examples in a computationally efficient manner? The main contribution of this work is to exhibit a strong connection between achieving robustness to adversarial examples, and a rich class of polynomial optimization problems, thereby making progress on the above questions. In particular, we leverage this connection to (a) design computationally efficient robust algorithms with provable guarantees for a large class of hypothesis, namely linear classifiers and degree-2 polynomial threshold functions (PTFs), (b) give a precise characterization of the price of achieving robustness in a computationally efficient manner for these classes, (c) design efficient algorithms to certify robustness and generate adversarial attacks in a principled manner for 2-layer neural networks. We empirically demonstrate the effectiveness of these attacks on real data.


Machine Intelligence at the Edge with Learning Centric Power Allocation

arXiv.org Artificial Intelligence

While machine-type communication (MTC) devices generate considerable amounts of data, they often cannot process the data due to limited energy and computation power. To empower MTC with intelligence, edge machine learning has been proposed. However, power allocation in this paradigm requires maximizing the learning performance instead of the communication throughput, for which the celebrated water-filling and max-min fairness algorithms become inefficient. To this end, this paper proposes learning centric power allocation (LCPA), which provides a new perspective to radio resource allocation in learning driven scenarios. By employing an empirical classification error model that is supported by learning theory, the LCPA is formulated as a nonconvex nonsmooth optimization problem, and is solved by majorization minimization (MM) framework. To get deeper insights into LCPA, asymptotic analysis shows that the transmit powers are inversely proportional to the channel gain, and scale exponentially with the learning parameters. This is in contrast to traditional power allocations where quality of wireless channels is the only consideration. Last but not least, to enable LCPA in large-scale settings, two optimization algorithms, termed mirror-prox LCPA and accelerated LCPA, are further proposed. Extensive numerical results demonstrate that the proposed LCPA algorithms outperform traditional power allocation algorithms, and the large-scale algorithms reduce the computation time by orders of magnitude compared with MM-based LCPA but still achieve competing learning performance.



Bundle Method Sketching for Low Rank Semidefinite Programming

arXiv.org Machine Learning

In this paper, we show that the bundle method can be applied to solve semidefinite programming problems with a low rank solution without ever constructing a full matrix. To accomplish this, we use recent results from randomly sketching matrix optimization problems and from the analysis of bundle methods. Under strong duality and strict complementarity of SDP, we achieve $\tilde{O}(\frac{1}{\epsilon})$ convergence rates for both the primal and the dual sequences, and the algorithm proposed outputs a $O(\sqrt{\epsilon})$ approximate solution $\hat{X}$ (measured by distances) with a low rank representation with at most $\tilde{O}(\frac{1}{\epsilon})$ many iterations.


Error bound of local minima and KL property of exponent 1/2 for squared F-norm regularized factorization

arXiv.org Machine Learning

This paper is concerned with the squared F(robenius)-norm regularized factorization form for noisy low-rank matrix recovery problems. Under a suitable assumption on the restricted condition number of the Hessian matrix of the loss function, we establish an error bound to the true matrix for those local minima whose ranks are not more than the rank of the true matrix. Then, for the least squares loss function, we achieve the KL property of exponent 1/2 for the F-norm regularized factorization function over its global minimum set under a restricted strong convexity assumption. These theoretical findings are also confirmed by applying an accelerated alternating minimization method to the F-norm regularized factorization problem.


The 'Ingredients' of Machine Learning Algorithms

#artificialintelligence

The esoteric nuances of machine learning algorithms and terminology can easily overwhelm the machine learning novice. As I was reading the Deep Learning book by Yoshua Bengio, Aaron Courville, and Ian Goodfellow, I was ecstatic when I reached the section that explained the common "recipe" that almost all machine learning algorithms share -- a dataset, a cost function, an optimization procedure, and a model. In this article, I summarize each universal'ingredient' of machine learning algorithms by dissecting them into their simplest components. With these'ingredients' in mind, you no longer have to view each new machine learning algorithm you encounter as an entity isolated from the others, but rather a unique combination of the four common elements described below. There are many types of machine learning algorithms.


Online Optimization with Predictions and Non-convex Losses

arXiv.org Machine Learning

We study online optimization in a setting where an online learner seeks to optimize a per-round hitting cost, which may be non-convex, while incurring a movement cost when changing actions between rounds. We ask: \textit{under what general conditions is it possible for an online learner to leverage predictions of future cost functions in order to achieve near-optimal costs?} Prior work has provided near-optimal online algorithms for specific combinations of assumptions about hitting and switching costs, but no general results are known. In this work, we give two general sufficient conditions that specify a relationship between the hitting and movement costs which guarantees that a new algorithm, Synchronized Fixed Horizon Control (SFHC), provides a $1+O(1/w)$ competitive ratio, where $w$ is the number of predictions available to the learner. Our conditions do not require the cost functions to be convex, and we also derive competitive ratio results for non-convex hitting and movement costs. Our results provide the first constant, dimension-free competitive ratio for online non-convex optimization with movement costs. Further, we give an example of a natural instance, Convex Body Chasing (CBC), where the sufficient conditions are not satisfied and we can prove that no online algorithm can have a competitive ratio that converges to 1.


Adaptive versus Standard Descent Methods and Robustness Against Adversarial Examples

arXiv.org Machine Learning

Since this phenomenon was first observed, researchers have attempted to develop methods which produce models that are robust to adversarial perturbations under specific attack models (Wong and Kolter (2018); Sinha et al. (2018); Raghunathan et al. (2018); Mirman et al. (2018); Madry et al. (2018); Zhang et al. (2019)). As machine learning proliferates into society, including security-critical settings like health care (Esteva et al. (2017)) or autonomous vehicles (Codevilla et al. (2018)), it is crucial to develop methods that allow us to understand the vulnerability of our models and design appropriate countermeasures. Additionally there is a growing literature on the theory of adversarial examples. Many of these results attempt to understand adversarial examples by constructing examples of learning problems for which it is difficult to construct a classifier that is robust to adversarial perturbations. This difficultly may arise due to sample complexity (Schmidt et al. (2018)), computational constraints (Bubeck et al. (2019); Degwekar et al. (2019)), or the high-dimensional geometry of the initial feature space (Shafahi et al. (2019); Khoury and Hadfield-Menell (2018)).


Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks

arXiv.org Machine Learning

Lipschitz constraints under L2 norm on deep neural networks are useful for provable adversarial robustness bounds, stable training, and Wasserstein distance estimation. While heuristic approaches such as the gradient penalty have seen much practical success, it is challenging to achieve similar practical performance while provably enforcing a Lipschitz constraint. In principle, one can design Lipschitz constrained architectures using the composition property of Lipschitz functions, but Anil et al. recently identified a key obstacle to this approach: gradient norm attenuation. They showed how to circumvent this problem in the case of fully connected networks by designing each layer to be gradient norm preserving. We extend their approach to train scalable, expressive, provably Lipschitz convolutional networks. In particular, we present the Block Convolution Orthogonal Parameterization (BCOP), an expressive parameterization of orthogonal convolution operations. We show that even though the space of orthogonal convolutions is disconnected, the largest connected component of BCOP with 2n channels can represent arbitrary BCOP convolutions over n channels. Our BCOP parameterization allows us to train large convolutional networks with provable Lipschitz bounds. Empirically, we find that it is competitive with existing approaches to provable adversarial robustness and Wasserstein distance estimation.