Goto

Collaborating Authors

 Optimization


A Smoother Way to Train Structured Prediction Models

arXiv.org Machine Learning

We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants, including extensions to deep structured prediction. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems.


Mode Collapse and Regularity of Optimal Transportation Maps

arXiv.org Machine Learning

This work builds the connection between the regularity theory of optimal transportation map, Monge-Amp\`{e}re equation and GANs, which gives a theoretic understanding of the major drawbacks of GANs: convergence difficulty and mode collapse. According to the regularity theory of Monge-Amp\`{e}re equation, if the support of the target measure is disconnected or just non-convex, the optimal transportation mapping is discontinuous. General DNNs can only approximate continuous mappings. This intrinsic conflict leads to the convergence difficulty and mode collapse in GANs. We test our hypothesis that the supports of real data distribution are in general non-convex, therefore the discontinuity is unavoidable using an Autoencoder combined with discrete optimal transportation map (AE-OT framework) on the CelebA data set. The testing result is positive. Furthermore, we propose to approximate the continuous Brenier potential directly based on discrete Brenier theory to tackle mode collapse. Comparing with existing method, this method is more accurate and effective.


Cost-Effective Incentive Allocation via Structured Counterfactual Inference

arXiv.org Machine Learning

We address a practical problem ubiquitous in modern industry, in which a mediator tries to learn a policy for allocating strategic financial incentives for customers in a marketing campaign and observes only bandit feedback. In contrast to traditional policy optimization frameworks, we rely on a specific assumption for the reward structure and we incorporate budget constraints. We develop a new two-step method for solving this constrained counterfactual policy optimization problem. First, we cast the reward estimation problem as a domain adaptation problem with supplementary structure. Subsequently, the estimators are used for optimizing the policy with constraints. We establish theoretical error bounds for our estimation procedure and we empirically show that the approach leads to significant improvement on both synthetic and real datasets.


Compatible Natural Gradient Policy Search

arXiv.org Machine Learning

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.


Learning Hierarchical Interactions at Scale: A Convex Optimization Approach

arXiv.org Machine Learning

In many learning settings, it is beneficial to augment the main features with pairwise interactions. Such interaction models can be often enhanced by performing variable selection under the so-called strong hierarchy constraint: an interaction is non-zero only if its associated main features are non-zero. Existing convex optimization based algorithms face difficulties in handling problems where the number of main features $p \sim 10^3$ (with total number of features $\sim p^2$). In this paper, we study a convex relaxation which enforces strong hierarchy and develop a scalable algorithm for solving it. Our proposed algorithm employs a proximal gradient method along with a novel active-set strategy, specialized screening rules, and decomposition rules towards verifying optimality conditions. Our framework can handle problems having dense design matrices, with $p = 50,000$ ($\sim 10^9$ interactions)---instances that are much larger than current state of the art. Experiments on real and synthetic data suggest that our toolkit hierScale outperforms the state of the art in terms of prediction and variable selection and can achieve over a 1000x speed-up.


A Convergence Analysis of Nonlinearly Constrained ADMM in Deep Learning

arXiv.org Machine Learning

Efficient training of deep neural networks (DNNs) is a challenge due to the associated highly nonconvex optimization. The alternating direction method of multipliers (ADMM) has attracted rising attention in deep learning for its potential of distributed computing. However, it remains an open problem to establish the convergence of ADMM in DNN training due to the nonlinear constraints involved. In this paper, we provide an answer to this problem by establishing the convergence of some nonlinearly constrained ADMM for DNNs with smooth activations. To be specific, we establish the global convergence to a Karush-Kuhn-Tucker (KKT) point at a ${\cal O}(1/k)$ rate. To achieve this goal, the key development lies in a new local linear approximation technique which enables us to overcome the hurdle of nonlinear constraints in ADMM for DNNs.


Linear Inequality Constraints for Neural Network Activations

arXiv.org Machine Learning

We propose a method to impose linear inequality constraints on neural network activations. The proposed method allows a data-driven training approach to be combined with modeling prior knowledge about the task. Our algorithm computes a suitable parameterization of the feasible set at initialization and uses standard variants of stochastic gradient descent to find solutions to the constrained network. Thus, the modeling constraints are always satisfied during training. Crucially, our approach avoids to solve a sub-optimization problem at each training step or to manually trade-off data and constraint fidelity with additional hyperparameters. We consider constrained generative modeling as an important application domain and experimentally demonstrate the proposed method by constraining a variational autoencoder.


Robust Regression via Online Feature Selection under Adversarial Data Corruption

arXiv.org Machine Learning

The presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (\textit{RoOFS}) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. We also prove that our algorithm has a restricted error bound compared to the optimal solution. Extensive empirical experiments in both synthetic and real-world datasets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.


Conditioning by adaptive sampling for robust design

arXiv.org Machine Learning

We present a new method for design problems wherein the goal is to maximize or specify the value of one or more properties of interest. For example, in protein design, one may wish to find the protein sequence that maximizes fluorescence. We assume access to one or more, potentially black box, stochastic "oracle" predictive functions, each of which maps from input (e.g., protein sequences) design space to a distribution over a property of interest (e.g. protein fluorescence). At first glance, this problem can be framed as one of optimizing the oracle(s) with respect to the input. However, many state-of-the-art predictive models, such as neural networks, are known to suffer from pathologies, especially for data far from the training distribution. Thus we need to modulate the optimization of the oracle inputs with prior knowledge about what makes `realistic' inputs (e.g., proteins that stably fold). Herein, we propose a new method to solve this problem, Conditioning by Adaptive Sampling, which yields state-of-the-art results on a protein fluorescence problem, as compared to other recently published approaches. Formally, our method achieves its success by using model-based adaptive sampling to estimate the conditional distribution of the input sequences given the desired properties.


How It Feels to Learn Data Science in 2019 โ€“ Towards Data Science

#artificialintelligence

So I just have to buy a Tableau license and I'm now a data scientist? Okay, let's just take that sales pitch with a grain of salt. I may be clueless, but I know there is more to data science than making pretty visualizations. I can even do that in Excel. You got to admit it is slick marketing though. Charting data is the fun stage, and they leave out the painful and time-consuming parts of working with data: cleaning, wrangling, transforming, and loading it. God help you if you need to write a specialized algorithm with your own domain logic when using closed tools. Yes, and that is why I suspect there is value in learning to code. Maybe you can learn Alteryx.