Optimization
Inhomogeneous Hypergraph Clustering with Applications
Hypergraph partitioning is an important problem in machine learning, computer vision and network analytics. A widely used method for hypergraph partitioning relies on minimizing a normalized sum of the costs of partitioning hyperedges across clusters. Algorithmic solutions based on this approach assume that different partitions of a hyperedge incur the same cost. However, this assumption fails to leverage the fact that different subsets of vertices within the same hyperedge may have different structural importance. We hence propose a new hypergraph clustering technique, termed inhomogeneous hypergraph partitioning, which assigns different costs to different hyperedge cuts. We prove that inhomogeneous partitioning produces a quadratic approximation to the optimal solution if the inhomogeneous costs satisfy submodularity constraints. Moreover, we demonstrate that inhomogenous partitioning offers significant performance improvements in applications such as structure learning of rankings, subspace segmentation and motif clustering.
Practical Bayesian Optimization for Model Fitting with Bayesian Adaptive Direct Search
Computational models in fields such as computational neuroscience are often evaluated via stochastic simulation or numerical approximation. Fitting these models implies a difficult optimization problem over complex, possibly noisy parameter landscapes. Bayesian optimization (BO) has been successfully applied to solving expensive black-box problems in engineering and machine learning. Here we explore whether BO can be applied as a general tool for model fitting. First, we present a novel hybrid BO algorithm, Bayesian adaptive direct search (BADS), that achieves competitive performance with an affordable computational overhead for the running time of typical models. We then perform an extensive benchmark of BADS vs. many common and state-of-the-art nonconvex, derivative-free optimizers, on a set of model-fitting problems with real data and models from six studies in behavioral, cognitive, and computational neuroscience. With default settings, BADS consistently finds comparable or better solutions than other methods, including `vanilla' BO, showing great promise for advanced BO techniques, and BADS in particular, as a general model-fitting tool.
Rate Optimal Estimation and Confidence Intervals for High-dimensional Regression with Missing Covariates
Wang, Yining, Wang, Jialei, Balakrishnan, Sivaraman, Singh, Aarti
Although a majority of the theoretical literature in high-dimensional statistics has focused on settings which involve fully-observed data, settings with missing values and corruptions are common in practice. We consider the problems of estimation and of constructing component-wise confidence intervals in a sparse high-dimensional linear regression model when some covariates of the design matrix are missing completely at random. We analyze a variant of the Dantzig selector [9] for estimating the regression model and we use a de-biasing argument to construct component-wise confidence intervals. Our first main result is to establish upper bounds on the estimation error as a function of the model parameters (the sparsity level s, the expected fraction of observed covariates $\rho_*$, and a measure of the signal strength $\|\beta^*\|_2$). We find that even in an idealized setting where the covariates are assumed to be missing completely at random, somewhat surprisingly and in contrast to the fully-observed setting, there is a dichotomy in the dependence on model parameters and much faster rates are obtained if the covariance matrix of the random design is known. To study this issue further, our second main contribution is to provide lower bounds on the estimation error showing that this discrepancy in rates is unavoidable in a minimax sense. We then consider the problem of high-dimensional inference in the presence of missing data. We construct and analyze confidence intervals using a de-biased estimator. In the presence of missing data, inference is complicated by the fact that the de-biasing matrix is correlated with the pilot estimator and this necessitates the design of a new estimator and a novel analysis. We also complement our mathematical study with extensive simulations on synthetic and semi-synthetic data that show the accuracy of our asymptotic predictions for finite sample sizes.
On- and Off-Policy Monotonic Policy Improvement
Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this paper, by lower bounding the performance difference of two policies, we show that the monotonic policy improvement is guaranteed from on- and off-policy mixture samples. An optimization procedure which applies the proposed bound can be regarded as an off-policy natural policy gradient method. In order to support the theoretical result, we provide a trust region policy optimization method using experience replay as a naive application of our bound, and evaluate its performance in two classical benchmark problems.
Alternating minimization and alternating descent over nonconvex sets
Ha, Wooseok, Barber, Rina Foygel
We analyze the performance of alternating minimization for loss functions optimized over two variables, where each variable may be restricted to lie in some potentially nonconvex constraint set. This type of setting arises naturally in high-dimensional statistics and signal processing, where the variables often reflect different structures or components within the signals being considered. Our analysis depends strongly on the notion of local concavity coefficients, which have been recently proposed in Barber and Ha (2017) to measure and quantify the concavity of a general nonconvex set. Our results further reveal important distinctions between alternating and non-alternating methods. Since computing the alternating minimization steps may not be tractable for some problems, we also consider an inexact version of the algorithm and provide a set of sufficient conditions to ensure fast convergence of the inexact algorithms. We demonstrate our framework on several examples, including low rank + sparse decomposition and multitask regression, and provide numerical experiments to validate our theoretical results.
Parallel Bayesian Global Optimization of Expensive Functions
Wang, Jialei, Clark, Scott C., Liu, Eric, Frazier, Peter I.
We consider parallel global optimization of derivative-free expensive-to-evaluate functions, and propose an efficient method based on stochastic approximation for implementing a conceptual Bayesian optimization algorithm proposed by Ginsbourger et al. (2007). To accomplish this, we use infinitessimal perturbation analysis (IPA) to construct a stochastic gradient estimator and show that this estimator is unbiased. We also show that the stochastic gradient ascent algorithm using the constructed gradient estimator converges to a stationary point of the q-EI surface, and therefore, as the number of multiple starts of the gradient ascent algorithm and the number of steps for each start grow large, the one-step Bayes optimal set of points is recovered. We show in numerical experiments that our method for maximizing the q-EI is faster than methods based on closed-form evaluation using high-dimensional integration, when considering many parallel function evaluations, and is comparable in speed when considering few. We also show that the resulting one-step Bayes optimal algorithm for parallel global optimization finds high-quality solutions with fewer evaluations than a heuristic based on approximately maximizing the q-EI. A high-quality open source implementation of this algorithm is available in the open source Metrics Optimization Engine (MOE).
Accelerated Sparse Subspace Clustering
Hashemi, Abolfazl, Vikalo, Haris
State-of-the-art algorithms for sparse subspace clustering perform spectral clustering on a similarity matrix typically obtained by representing each data point as a sparse combination of other points using either basis pursuit (BP) or orthogonal matching pursuit (OMP). BP-based methods are often prohibitive in practice while the performance of OMP-based schemes are unsatisfactory, especially in settings where data points are highly similar. In this paper, we propose a novel algorithm that exploits an accelerated variant of orthogonal least-squares to efficiently find the underlying subspaces. We show that under certain conditions the proposed algorithm returns a subspace-preserving solution. Simulation results illustrate that the proposed method compares favorably with BP-based method in terms of running time while being significantly more accurate than OMP-based schemes.
Sampling and Reconstruction of Graph Signals via Weak Submodularity and Semidefinite Relaxation
Hashemi, Abolfazl, Shafipour, Rasoul, Vikalo, Haris, Mateos, Gonzalo
ABSTRACT We study the problem of sampling a bandlimited graph signal in the presence of noise, where the objective is to select a node subset of prescribed cardinality that minimizes the signal reconstruction mean squared error (MSE). To that end, we formulate the task at hand as the minimization of MSE subject to binary constraints, and approximate the resulting NPhard problem via semidefinite programming (SDP) relaxation. Moreover, we provide an alternative formulation based on maximizing a monotone weak submodular function and propose a randomized-greedy algorithm to find a sub-optimal subset. We then derive a worst-case performance guarantee on the MSE returned by the randomized greedy algorithm for general non-stationary graph signals. The efficacy of the proposed methods is illustrated through numerical simulations on synthetic and realworld graphs.
Synth-Validation: Selecting the Best Causal Inference Method for a Given Dataset
Schuler, Alejandro, Jung, Ken, Tibshirani, Robert, Hastie, Trevor, Shah, Nigam
Many decisions in healthcare, business, and other policy domains are made without the support of rigorous evidence due to the cost and complexity of performing randomized experiments. Using observational data to answer causal questions is risky: subjects who receive different treatments also differ in other ways that affect outcomes. Many causal inference methods have been developed to mitigate these biases. However, there is no way to know which method might produce the best estimate of a treatment effect in a given study. In analogy to cross-validation, which estimates the prediction error of predictive models applied to a given dataset, we propose synth-validation, a procedure that estimates the estimation error of causal inference methods applied to a given dataset. In synth-validation, we use the observed data to estimate generative distributions with known treatment effects. We apply each causal inference method to datasets sampled from these distributions and compare the effect estimates with the known effects to estimate error. Using simulations, we show that using synth-validation to select a causal inference method for each study lowers the expected estimation error relative to consistently using any single method.
Regularization via Mass Transportation
Shafieezadeh-Abadeh, Soroosh, Kuhn, Daniel, Esfahani, Peyman Mohajerin
The goal of regression and classification methods in supervised learning is to minimize the empirical risk, that is, the expectation of some loss function quantifying the prediction error under the empirical distribution. When facing scarce training data, overfitting is typically mitigated by adding regularization terms to the objective that penalize hypothesis complexity. In this paper we introduce new regularization techniques using ideas from distributionally robust optimization, and we give new probabilistic interpretations to existing techniques. Specifically, we propose to minimize the worst-case expected loss, where the worst case is taken over the ball of all (continuous or discrete) distributions that have a bounded transportation distance from the (discrete) empirical distribution. By choosing the radius of this ball judiciously, we can guarantee that the worst-case expected loss provides an upper confidence bound on the loss on test data, thus offering new generalization bounds. We prove that the resulting regularized learning problems are tractable and can be tractably kernelized for many popular loss functions. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.