Optimization
A Look at the Effect of Sample Design on Generalization through the Lens of Spectral Analysis
Kailkhura, Bhavya, Thiagarajan, Jayaraman J., Li, Qunwei, Bremer, Peer-Timo
This paper provides a general framework to study the effect of sampling properties of training data on the generalization error of the learned machine learning (ML) models. Specifically, we propose a new spectral analysis of the generalization error, expressed in terms of the power spectra of the sampling pattern and the function involved. The framework is build in the Euclidean space using Fourier analysis and establishes a connection between some high dimensional geometric objects and optimal spectral form of different state-of-the-art sampling patterns. Subsequently, we estimate the expected error bounds and convergence rate of different state-of-the-art sampling patterns, as the number of samples and dimensions increase. We make several observations about generalization error which are valid irrespective of the approximation scheme (or learning architecture) and training (or optimization) algorithms. Our result also sheds light on ways to formulate design principles for constructing optimal sampling methods for particular problems.
Optimal Transport Relaxations with Application to Wasserstein GANs
Mahdian, Saied, Blanchet, Jose, Glynn, Peter
Optimal transport costs, which include the Wasserstein Distance and the Earth-Mover-Distance as special cases, have become useful tools in machine learning and statistics [16, 3, 1, 18, 8, 6]. The optimal transport cost between two distributions is computed (in its primal form) as a minimization problem, in which the cost of transporting one distribution to another is minimized over all possible joint distributions, leading to linear program (see for example,[26]). Optimal transport provides great flexibility when comparing (probability) measures and histograms. The transportation cost function (which we refer to as the cost function) can be used to capture key geometric characteristics [16]. It can be also used to compare discrete vs continuous distributions directly, without introducing smoothing, in contrast to alternatives such as the Kullback-Leibler divergence (see [3, 13] for more details). Also, by judiciously choosing the cost function, a Wasserstein distance can generate either the topology corresponding to weak convergence or the total variation distance. In data-driven applications, one needs to estimate the optimal transport cost by means of sampled data.
Robust Attacks against Multiple Classifiers
Perdomo, Juan C., Singer, Yaron
We address the challenge of designing optimal adversarial noise algorithms for settings where a learner has access to multiple classifiers. We demonstrate how this problem can be framed as finding strategies at equilibrium in a two-player, zero-sum game between a learner and an adversary. In doing so, we illustrate the need for randomization in adversarial attacks. In order to compute Nash equilibrium, our main technical focus is on the design of best response oracles that can then be implemented within a Multiplicative Weights Update framework to boost deterministic perturbations against a set of models into optimal mixed strategies. We demonstrate the practical effectiveness of our approach on a series of image classification tasks using both linear classifiers and deep neural networks.
A General $\mathcal{O}(n^2)$ Hyper-Parameter Optimization for Gaussian Process Regression with Cross-Validation and Non-linearly Constrained ADMM
Xu, Linning, Yin, Feng, Zhang, Jiawei, Luo, Zhi-Quan, Cui, Shuguang
Hyper-parameter optimization remains as the core issue of Gaussian process (GP) for machine learning nowadays. The benchmark method using maximum likelihood (ML) estimation and gradient descent (GD) is impractical for processing big data due to its $O(n^3)$ complexity. Many sophisticated global or local approximation models, for instance, sparse GP, distributed GP, have been proposed to address such complexity issue. In this paper, we propose two novel and general-purpose GP hyper-parameter training schemes (GPCV-ADMM) by replacing ML with cross-validation (CV) as the fitting criterion and replacing GD with a non-linearly constrained alternating direction method of multipliers (ADMM) as the optimization method. The proposed schemes are of $O(n^2)$ complexity for any covariance matrix without special structure. We conduct various experiments based on both synthetic and real data sets, wherein the proposed schemes show excellent performance in terms of convergence, hyper-parameter estimation accuracy, and computational time in comparison with the traditional ML based routines given in the GPML toolbox.
Fair Division Without Disparate Impact
Peysakhovich, Alexander, Kroer, Christian
We consider the problem of dividing items between individuals in a way that is fair both in the sense of distributional fairness and in the sense of not having disparate impact across protected classes. An important existing mechanism for distributionally fair division is competitive equilibrium from equal incomes (CEEI). Unfortunately, CEEI will not, in general, respect disparate impact constraints. We consider two types of disparate impact measures: requiring that allocations be similar across protected classes and requiring that average utility levels be similar across protected classes. We modify the standard CEEI algorithm in two ways: equitable equilibrium from equal incomes, which removes disparate impact in allocations, and competitive equilibrium from equitable incomes which removes disparate impact in attained utility levels. We show analytically that removing disparate impact in outcomes breaks several of CEEI's desirable properties such as envy, regret, Pareto optimality, and incentive compatibility. By contrast, we can remove disparate impact in attained utility levels without affecting these properties. Finally, we experimentally evaluate the tradeoffs between efficiency, equity, and disparate impact in a recommender-system based market.
Complete Dictionary Learning via $\ell^4$-Norm Maximization over the Orthogonal Group
Zhai, Yuexiang, Yang, Zitong, Liao, Zhenyu, Wright, John, Ma, Yi
This paper considers the fundamental problem of learning a complete (orthogonal) dictionary from samples of sparsely generated signals. Most existing methods solve the dictionary (and sparse representations) based on heuristic algorithms, usually without theoretical guarantees for either optimality or complexity. The recent $\ell^1$-minimization based methods do provide such guarantees but the associated algorithms recover the dictionary one column at a time. In this work, we propose a new formulation that maximizes the $\ell^4$-norm over the orthogonal group, to learn the entire dictionary. We prove that under a random data model, with nearly minimum sample complexity, the global optima of the $\ell^4$ norm are very close to signed permutations of the ground truth. Inspired by this observation, we give a conceptually simple and yet effective algorithm based on `matching, stretching, and projection' (MSP). The algorithm provably converges locally at a superlinear (cubic) rate and cost per iteration is merely an SVD. In addition to strong theoretical guarantees, experiments show that the new algorithm is significantly more efficient and effective than existing methods, including KSVD and $\ell^1$-based methods. Preliminary experimental results on real images clearly demonstrate advantages of so learned dictionary over classic PCA bases.
Nonconvex Approach for Sparse and Low-Rank Constrained Models with Dual Momentum
We first propose a novel nonconvex rank surrogate on the general rank minimization problem and apply this to the corrupted image completion problem. Then, we propose that nonconvex rank surrogates can be introduced into two well-known sparse and low-rank models: Robust Principal Component Analysis (RPCA) and Low-Rank Representation (LRR). For optimization, we use alternating direction method of multipliers (ADMM) and propose a trick, which is called the dual momentum. We add the difference of the dual variable between the current and the last iteration with a weight. This trick can avoid the local minimum problem and make the algorithm converge to a solution with smaller recovery error in the nonconvex optimization problem. Also, it can boost the convergence when the variable updates too slowly. We also give a severe proof and verify that the proposed algorithms are convergent. Then, several experiments are conducted, including image completion, denoising, and spectral clustering with outlier detection. These experiments show that the proposed methods are effective in image and signal processing applications, and have the best performance compared with state-of-the-art methods.
Machine Learning and System Identification for Estimation in Physical Systems
In this thesis, we draw inspiration from both classical system identification and modern machine learning in order to solve estimation problems for real-world, physical systems. The main approach to estimation and learning adopted is optimization based. Concepts such as regularization will be utilized for encoding of prior knowledge and basis-function expansions will be used to add nonlinear modeling power while keeping data requirements practical. The thesis covers a wide range of applications, many inspired by applications within robotics, but also extending outside this already wide field. Usage of the proposed methods and algorithms are in many cases illustrated in the real-world applications that motivated the research. Topics covered include dynamics modeling and estimation, model-based reinforcement learning, spectral estimation, friction modeling and state estimation and calibration in robotic machining. In the work on modeling and identification of dynamics, we develop regularization strategies that allow us to incorporate prior domain knowledge into flexible, overparameterized models. We make use of classical control theory to gain insight into training and regularization while using flexible tools from modern deep learning. A particular focus of the work is to allow use of modern methods in scenarios where gathering data is associated with a high cost. In the robotics-inspired parts of the thesis, we develop methods that are practically motivated and ensure that they are implementable also outside the research setting. We demonstrate this by performing experiments in realistic settings and providing open-source implementations of all proposed methods and algorithms.
On the Convergence of SARAH and Beyond
Li, Bingcong, Ma, Meng, Giannakis, Georgios B.
The main theme of this work is a unifying algorithm, abbreviated as L2S, that can deal with (strongly) convex and nonconvex empirical risk minimization (ERM) problems. It broadens a recently developed variance reduction method known as SARAH. L2S enjoys a linear convergence rate for strongly convex problems, which also implies the last iteration of SARAH's inner loop converges linearly. For convex problems, different from SARAH, L2S can afford step and mini-batch sizes not dependent on the data size $n$, and the complexity needed to guarantee $\mathbb{E}[\|\nabla F(\mathbf{x}) \|^2] \leq \epsilon$ is ${\cal O}(n+ n/\epsilon)$. For nonconvex problems on the other hand, the complexity is ${\cal O}(n+ \sqrt{n}/\epsilon)$. Parallel to L2S there are a few side results. Leveraging an aggressive step size, D2S is proposed, which provides a more efficient alternative to L2S and SARAH-like algorithms. Specifically, D2S requires a reduced IFO complexity of ${\cal O}\big( (n+ \bar{\kappa}) \ln (1/\epsilon) \big)$ for strongly convex problems. Moreover, to avoid the tedious selection of the optimal step size, an automatic tuning scheme is developed, which obtains comparable empirical performance with SARAH using judiciously tuned step size.
Fair Distributions from Biased Samples: A Maximum Entropy Optimization Framework
Celis, L. Elisa, Keswani, Vijay, Yildiz, Ozan, Vishnoi, Nisheeth K.
One reason for the emergence of bias in AI systems is biased data -- datasets that may not be true representations of the underlying distributions -- and may over or under-represent groups with respect to protected attributes such as gender or race. We consider the problem of correcting such biases and learning distributions that are "fair", with respect to measures such as proportional representation and statistical parity, from the given samples. Our approach is based on a novel formulation of the problem of learning a fair distribution as a maximum entropy optimization problem with a given expectation vector and a prior distribution. Technically, our main contributions are: (1) a new second-order method to compute the (dual of the) maximum entropy distribution over an exponentially-sized discrete domain that turns out to be faster than previous methods, and (2) methods to construct prior distributions and expectation vectors that provably guarantee that the learned distributions satisfy a wide class of fairness criteria. Our results also come with quantitative bounds on the total variation distance between the empirical distribution obtained from the samples and the learned fair distribution. Our experimental results include testing our approach on the COMPAS dataset and showing that the fair distributions not only improve disparate impact values but when used to train classifiers only incur a small loss of accuracy.