Statistical Learning
Sparse Nonlinear Regression: Parameter Estimation and Asymptotic Inference
Yang, Zhuoran, Wang, Zhaoran, Liu, Han, Eldar, Yonina C., Zhang, Tong
We study parameter estimation and asymptotic inference for sparse nonlinear regression. More specifically, we assume the data are given by $y = f( x^\top \beta^* ) + \epsilon$, where $f$ is nonlinear. To recover $\beta^*$, we propose an $\ell_1$-regularized least-squares estimator. Unlike classical linear regression, the corresponding optimization problem is nonconvex because of the nonlinearity of $f$. In spite of the nonconvexity, we prove that under mild conditions, every stationary point of the objective enjoys an optimal statistical rate of convergence. In addition, we provide an efficient algorithm that provably converges to a stationary point. We also access the uncertainty of the obtained estimator. Specifically, based on any stationary point of the objective, we construct valid hypothesis tests and confidence intervals for the low dimensional components of the high-dimensional parameter $\beta^*$. Detailed numerical results are provided to back up our theory.
Scalable Gaussian Processes for Characterizing Multidimensional Change Surfaces
Herlands, William, Wilson, Andrew, Nickisch, Hannes, Flaxman, Seth, Neill, Daniel, van Panhuis, Wilbert, Xing, Eric
We present a scalable Gaussian process model for identifying and characterizing smooth multidimensional changepoints, and automatically learning changes in expressive covariance structure. We use Random Kitchen Sink features to flexibly define a change surface in combination with expressive spectral mixture kernels to capture the complex statistical structure. Finally, through the use of novel methods for additive non-separable kernels, we can scale the model to large datasets. We demonstrate the model on numerical and real world data, including a large spatio-temporal disease dataset where we identify previously unknown heterogeneous changes in space and time.
$k$-means: Fighting against Degeneracy in Sequential Monte Carlo with an Application to Tracking
For regular particle filter algorithm or Sequential Monte Carlo (SMC) methods, the initial weights are traditionally dependent on the proposed distribution, the posterior distribution at the current timestamp in the sampled sequence, and the target is the posterior distribution of the previous timestamp. This is technically correct, but leads to algorithms which usually have practical issues with degeneracy, where all particles eventually collapse onto a single particle. In this paper, we propose and evaluate using $k$ means clustering to attack and even take advantage of this degeneracy. Specifically, we propose a Stochastic SMC algorithm which initializes the set of $k$ means, providing the initial centers chosen from the collapsed particles. To fight against degeneracy, we adjust the regular SMC weights, mediated by cluster proportions, and then correct them to retain the same expectation as before. We experimentally demonstrate that our approach has better performance than vanilla algorithms.
Automatic Inference of the Quantile Parameter
Ramamurthy, Karthikeyan Natesan, Aravkin, Aleksandr Y., Thiagarajan, Jayaraman J.
Supervised learning is an active research area, with numerous applications in diverse fields such as data analytics, computer vision, speech and audio processing, and image understanding. In most cases, the loss functions used in machine learning assume symmetric noise models, and seek to estimate the unknown function parameters. However, loss functions such as quantile and quantile Huber generalize the symmetric $\ell_1$ and Huber losses to the asymmetric setting, for a fixed quantile parameter. In this paper, we propose to jointly infer the quantile parameter and the unknown function parameters, for the asymmetric quantile Huber and quantile losses. We explore various properties of the quantile Huber loss and implement a convexity certificate that can be used to check convexity in the quantile parameter. When the loss if convex with respect to the parameter of the function, we prove that it is biconvex in both the function and the quantile parameters, and propose an algorithm to jointly estimate these. Results with synthetic and real data demonstrate that the proposed approach can automatically recover the quantile parameter corresponding to the noise and also provide an improved recovery of function parameters. To illustrate the potential of the framework, we extend the gradient boosting machines with quantile losses to automatically estimate the quantile parameter at each iteration.
Bayesian Analysis of Dynamic Linear Topic Models
Glynn, Chris, Tokdar, Surya T., Banks, David L., Howard, Brian
In dynamic topic modeling, the proportional contribution of a topic to a document depends on the temporal dynamics of that topic's overall prevalence in the corpus. We extend the Dynamic Topic Model of Blei and Lafferty (2006) by explicitly modeling document level topic proportions with covariates and dynamic structure that includes polynomial trends and periodicity. A Markov Chain Monte Carlo (MCMC) algorithm that utilizes Polya-Gamma data augmentation is developed for posterior inference. Conditional independencies in the model and sampling are made explicit, and our MCMC algorithm is parallelized where possible to allow for inference in large corpora. To address computational bottlenecks associated with Polya-Gamma sampling, we appeal to the Central Limit Theorem to develop a Gaussian approximation to the Polya-Gamma random variable. This approximation is fast and reliable for parameter values relevant in the text mining domain. Our model and inference algorithm are validated with multiple simulation examples, and we consider the application of modeling trends in PubMed abstracts. We demonstrate that sharing information across documents is critical for accurately estimating document-specific topic proportions. We also show that explicitly modeling polynomial and periodic behavior improves our ability to predict topic prevalence at future time points.
Bayesian group latent factor analysis with structured sparsity
Zhao, Shiwen, Gao, Chuan, Mukherjee, Sayan, Engelhardt, Barbara E
Latent factor models are the canonical statistical tool for exploratory analyses of low-dimensional linear structure for an observation matrix with p features across n samples. We develop a structured Bayesian group factor analysis model that extends the factor model to multiple coupled observation matrices; in the case of two observations, this reduces to a Bayesian model of canonical correlation analysis. The main contribution of this work is to carefully define a structured Bayesian prior that encourages both element-wise and column-wise shrinkage and leads to desirable behavior on high-dimensional data. In particular, our model puts a structured prior on the joint factor loading matrix, regularizing at three levels, which enables element-wise sparsity and unsupervised recovery of latent factors corresponding to structured variance across arbitrary subsets of the observations. In addition, our structured prior allows for both dense and sparse latent factors so that covariation among either all features or only a subset of features can both be recovered. We use fast parameter-expanded expectation-maximization for parameter estimation in this model. We validate our method on both simulated data with substantial structure and real data, comparing against a number of state-of-the-art approaches. These results illustrate useful properties of our model, including i) recovering sparse signal in the presence of dense effects; ii) the ability to scale naturally to large numbers of observations; iii) flexible observation- and factor-specific regularization to recover factors with a wide variety of sparsity levels and percentage of variance explained; and iv) tractable inference that scales to modern genomic and document data sizes.
Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints
Wang, Mengdi, Chen, Yichen, Liu, Jialin, Gu, Yuantao
Consider convex optimization problems subject to a large number of constraints. We focus on stochastic problems in which the objective takes the form of expected values and the feasible set is the intersection of a large number of convex sets. We propose a class of algorithms that perform both stochastic gradient descent and random feasibility updates simultaneously. At every iteration, the algorithms sample a number of projection points onto a randomly selected small subsets of all constraints. Three feasibility update schemes are considered: averaging over random projected points, projecting onto the most distant sample, projecting onto a special polyhedral set constructed based on sample points. We prove the almost sure convergence of these algorithms, and analyze the iterates' feasibility error and optimality error, respectively. We provide new convergence rate benchmarks for stochastic first-order optimization with many constraints. The rate analysis and numerical experiments reveal that the algorithm using the polyhedral-set projection scheme is the most efficient one within known algorithms.
Online Principal Component Analysis in High Dimension: Which Algorithm to Choose?
In the current context of data explosion, online techniques that do not require storing all data in memory are indispensable to routinely perform tasks like principal component analysis (PCA). Recursive algorithms that update the PCA with each new observation have been studied in various fields of research and found wide applications in industrial monitoring, computer vision, astronomy, and latent semantic indexing, among others. This work provides guidance for selecting an online PCA algorithm in practice. We present the main approaches to online PCA, namely, perturbation techniques, incremental methods, and stochastic optimization, and compare their statistical accuracy, computation time, and memory requirements using artificial and real data. Extensions to missing data and to functional data are discussed. All studied algorithms are available in the R package onlinePCA on CRAN.
Instantaneous Modelling and Reverse Engineering of DataConsistent Prime Models in Seconds!
A theoretical framework that supports automated construction of dynamic prime models purely from experimental time series data has been invented and developed, which can automatically generate (construct) data-driven models of any time series data in seconds. This has resulted in the formulation and formalisation of new reverse engineering and dynamic methods for automated systems modelling of complex systems, including complex biological, financial, control, and artificial neural network systems. The systems/model theory behind the invention has been formalised as a new, effective and robust system identification strategy complementary to process-based modelling. The proposed dynamic modelling and network inference solutions often involve tackling extremely difficult parameter estimation challenges, inferring unknown underlying network structures, and unsupervised formulation and construction of smart and intelligent ODE models of complex systems. In underdetermined conditions, i.e., cases of dealing with how best to instantaneously and rapidly construct data-consistent prime models of unknown (or well-studied) complex system from small-sized time series data, inference of unknown underlying network of interaction is more challenging. This article reports a robust step-by-step mathematical and computational analysis of the entire prime model construction process that determines a model from data in less than a minute.
Granger Causality in Multi-variate Time Series using a Time Ordered Restricted Vector Autoregressive Model
Siggiridou, Elsa, Kugiumtzis, Dimitris
Granger causality has been used for the investigation of the inter-dependence structure of the underlying systems of multi-variate time series. In particular, the direct causal effects are commonly estimated by the conditional Granger causality index (CGCI). In the presence of many observed variables and relatively short time series, CGCI may fail because it is based on vector autoregressive models (VAR) involving a large number of coefficients to be estimated. In this work, the VAR is restricted by a scheme that modifies the recently developed method of backward-in-time selection (BTS) of the lagged variables and the CGCI is combined with BTS. Further, the proposed approach is compared favorably to other restricted VAR representations, such as the top-down strategy, the bottom-up strategy, and the least absolute shrinkage and selection operator (LASSO), in terms of sensitivity and specificity of CGCI. This is shown by using simulations of linear and nonlinear, low and high-dimensional systems and different time series lengths. For nonlinear systems, CGCI from the restricted VAR representations are compared with analogous nonlinear causality indices. Further, CGCI in conjunction with BTS and other restricted VAR representations is applied to multi-channel scalp electroencephalogram (EEG) recordings of epileptic patients containing epileptiform discharges. CGCI on the restricted VAR, and BTS in particular, could track the changes in brain connectivity before, during and after epileptiform discharges, which was not possible using the full VAR representation.