AITopics | Statistical Learning

Collaborating Authors

Statistical Learning

News Overviews Instructional Materials AI-Alerts Classics

Sparse Generalized Principal Component Analysis for Large-scale Applications beyond Gaussianity

arXiv.org Machine LearningJan-27-2016

Principal Component Analysis (PCA) is a dimension reduction technique. It produces inconsistent estimators when the dimensionality is moderate to high, which is often the problem in modern large-scale applications where algorithm scalability and model interpretability are difficult to achieve, not to mention the prevalence of missing values. While existing sparse PCA methods alleviate inconsistency, they are constrained to the Gaussian assumption of classical PCA and fail to address algorithm scalability issues. We generalize sparse PCA to the broad exponential family distributions under high-dimensional setup, with built-in treatment for missing values. Meanwhile we propose a family of iterative sparse generalized PCA (SG-PCA) algorithms such that despite the non-convexity and non-smoothness of the optimization task, the loss function decreases in every iteration. In terms of ease and intuitive parameter tuning, our sparsity-inducing regularization is far superior to the popular Lasso. Furthermore, to promote overall scalability, accelerated gradient is integrated for fast convergence, while a progressive screening technique gradually squeezes out nuisance dimensions of a large-scale problem for feasible optimization. High-dimensional simulation and real data experiments demonstrate the efficiency and efficacy of SG-PCA.

algorithm, dimension, sg-pca, (15 more...)

arXiv.org Machine Learning

1512.03883

Country:

North America > United States > Florida > Leon County > Tallahassee (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Principal Component Analysis (0.61)

Add feedback

Hierarchical Vector Autoregression

Nicholson, William B., Bien, Jacob, Matteson, David S.

arXiv.org Machine LearningJan-27-2016

Vector autoregression (VAR) is a fundamental tool for modeling the joint dynamics of multivariate time series. However, as the number of component series is increased, the VAR model quickly becomes overparameterized, making reliable estimation difficult and impeding its adoption as a forecasting tool in high dimensional settings. A number of authors have sought to address this issue by incorporating regularized approaches, such as the lasso, that impose sparse or low-rank structures on the estimated coefficient parameters of the VAR. More traditional approaches attempt to address overparameterization by selecting a low lag order, based on the assumption that dynamic dependence among components is short-range. However, these methods typically assume a single, universal lag order that applies across all components, unnecessarily constraining the dynamic relationship between the components and impeding forecast performance. The lasso-based approaches are more flexible but do not incorporate the notion of lag order selection. We propose a new class of regularized VAR models, called hierarchical vector autoregression (HVAR), that embed the notion of lag selection into a convex regularizer. The key convex modeling tool is a group lasso with nested groups which ensure the sparsity pattern of autoregressive lag coefficients honors the ordered structure inherent to VAR. We provide computationally efficient algorithms for solving HVAR problems that can be parallelized across the components. A simulation study shows the improved performance in forecasting and lag order selection over previous approaches, and a macroeconomic application further highlights forecasting improvements as well as the convenient, interpretable output of a HVAR model.

artificial intelligence, lag structure, machine learning, (18 more...)

arXiv.org Machine Learning

1412.525

Country:

North America > United States (1.00)
South America (0.68)

Genre: Research Report (0.64)

Industry:

Banking & Finance > Economy (1.00)
Government > Regional Government > North America Government > United States Government (0.94)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Learning Model-Based Sparsity via Projected Gradient Descent

Bahmani, Sohail, Boufounos, Petros T., Raj, Bhiksha

arXiv.org Machine LearningJan-27-2016

Several convex formulation methods have been proposed previously for statistical estimation with structured sparsity as the prior. These methods often require a carefully tuned regularization parameter, often a cumbersome or heuristic exercise. Furthermore, the estimate that these methods produce might not belong to the desired sparsity model, albeit accurately approximating the true parameter. Therefore, greedy-type algorithms could often be more desirable in estimating structured-sparse parameters. So far, these greedy methods have mostly focused on linear statistical models. In this paper we study the projected gradient descent with non-convex structured-sparse parameter model as the constraint set. Should the cost function have a Stable Model-Restricted Hessian the algorithm produces an approximation for the desired minimizer. As an example we elaborate on application of the main results to estimation in Generalized Linear Model.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

doi: 10.1109/TIT.2016.2515078

1209.1557

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Font Identification in Historical Documents Using Active Learning

Gupta, Anshul, Gutierrez-Osuna, Ricardo, Christy, Matthew, Furuta, Richard, Mandell, Laura

arXiv.org Machine LearningJan-26-2016

Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.

artificial intelligence, machine learning, pattern recognition, (16 more...)

arXiv.org Machine Learning

1601.07252

Country: North America > United States > Texas (0.15)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Supersparse Linear Integer Models for Optimized Medical Scoring Systems

Ustun, Berk, Rudin, Cynthia

arXiv.org Machine LearningJan-26-2016

Scoring systems are linear classification models that only require users to add, subtract and multiply a few small numbers in order to make a prediction. These models are in widespread use by the medical community, but are difficult to learn from data because they need to be accurate and sparse, have coprime integer coefficients, and satisfy multiple operational constraints. We present a new method for creating data-driven scoring systems called a Supersparse Linear Integer Model (SLIM). SLIM scoring systems are built by solving an integer program that directly encodes measures of accuracy (the 0-1 loss) and sparsity (the $\ell_0$-seminorm) while restricting coefficients to coprime integers. SLIM can seamlessly incorporate a wide range of operational constraints related to accuracy and sparsity, and can produce highly tailored models without parameter tuning. We provide bounds on the testing and training accuracy of SLIM scoring systems, and present a new data reduction technique that can improve scalability by eliminating a portion of the training data beforehand. Our paper includes results from a collaboration with the Massachusetts General Hospital Sleep Laboratory, where SLIM was used to create a highly tailored scoring system for sleep apnea screening

artificial intelligence, constraint, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1007/s10994-015-5528-6

1502.04269

Country: North America > United States > Massachusetts (0.34)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Conditional distribution variability measures for causality detection

Fonollosa, José A. R.

arXiv.org Machine LearningJan-25-2016

In this paper we derive variability measures for the conditional probability distributions of a pair of random variables, and we study its application in the inference of causal-effect relationships. We also study the combination of the proposed measures with standard statistical measures in the the framework of the ChaLearn cause-effect pair challenge. The developed model obtains an AUC score of 0.82 on the final test database and ranked second in the challenge.

artificial intelligence, conditional distribution variability measure, machine learning, (13 more...)

arXiv.org Machine Learning

1601.0668

Country: North America > United States > New York (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Minimax Structured Normal Means Inference

Krishnamurthy, Akshay

arXiv.org Machine LearningJan-25-2016

The prevalence of high-dimensional signals in modern scientific investigation has inspired an influx of research on recovering structural information from noisy data. These problems arise across a variety of scientific and engineering disciplines; for example identifying cluster structure in communication or social networks, multiple hypothesis testing in genomics, or anomaly detection in sensor networking. Specific structural assumptions include sparsity [13], low-rankedness [11], cluster structure [15], and many others [8]. The literature in this direction focuses on three inference goals: detection, localization or recovery, and estimation or denoising. Detection tasks involve deciding whether an observation contains some meaningful information or is simply ambient noise, while recovery and estimation tasks involve more precisely characterizing the information contained in a signal. These problems are closely related, but also exhibit important differences, and this paper focuses on the recovery problem, where the goal is to identify, from a finite collection of signals, which signal produced the observed data. One frustration among researchers is that algorithmic and analytic techniques for these problems differ significantly for different structural assumptions. This issue was recently resolved in the context of the estimation, where the atomic norm [8] has provided a unifying algorithmic and analytical framework, but no such theory is available for detection and recovery problems. In this paper, we provide a unification for the recovery problem, leading to deeper understanding of how signal structure affects statistical performance.

data mining, machine learning, minimax risk, (18 more...)

arXiv.org Machine Learning

1506.07902

Genre: Research Report (1.00)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)

Add feedback

Information Limits for Recovering a Hidden Community

Hajek, Bruce, Wu, Yihong, Xu, Jiaming

arXiv.org Machine LearningJan-24-2016

We study the problem of recovering a hidden community of cardinality $K$ from an $n \times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} \sim P$ if $i, j$ both belong to the community and $A_{ij} \sim Q$ otherwise, for two known probability distributions $P$ and $Q$ depending on $n$. If $P={\rm Bern}(p)$ and $Q={\rm Bern}(q)$ with $p>q$, it reduces to the problem of finding a densely-connected $K$-subgraph planted in a large Erd\"os-R\'enyi graph; if $P=\mathcal{N}(\mu,1)$ and $Q=\mathcal{N}(0,1)$ with $\mu>0$, it corresponds to the problem of locating a $K \times K$ principal submatrix of elevated means in a large Gaussian random matrix. We focus on two types of asymptotic recovery guarantees as $n \to \infty$: (1) weak recovery: expected number of classification errors is $o(K)$; (2) exact recovery: probability of classifying all indices correctly converges to one. Under mild assumptions on $P$ and $Q$, and allowing the community size to scale sublinearly with $n$, we derive a set of sufficient conditions and a set of necessary conditions for recovery, which are asymptotically tight with sharp constants. The results hold in particular for the Gaussian case, and for the case of bounded log likelihood ratio, including the Bernoulli case whenever $\frac{p}{q}$ and $\frac{1-p}{1-q}$ are bounded away from zero and infinity. An important algorithmic implication is that, whenever exact recovery is information theoretically possible, any algorithm that provides weak recovery when the community size is concentrated near $K$ can be upgraded to achieve exact recovery in linear additional time by a simple voting procedure.

artificial intelligence, machine learning, recovery, (19 more...)

arXiv.org Machine Learning

1509.07859

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Póczos, Barnabás, Smola, Alex

arXiv.org Machine LearningJan-24-2016

We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale applications---have not been studied. We bridge this gap by presenting a unifying framework for many variance reduction techniques. Subsequently, we propose an asynchronous algorithm grounded in our framework, and prove its fast convergence. An important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as SVRG and SAGA as a byproduct. Our method achieves near linear speedup in sparse settings common to machine learning. We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1506.0684

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Minimax Lower Bounds for Linear Independence Testing

Ramdas, Aaditya, Isenberg, David, Singh, Aarti, Wasserman, Larry

arXiv.org Machine LearningJan-23-2016

Linear independence testing is a fundamental information-theoretic and statistical problem that can be posed as follows: given $n$ points $\{(X_i,Y_i)\}^n_{i=1}$ from a $p+q$ dimensional multivariate distribution where $X_i \in \mathbb{R}^p$ and $Y_i \in\mathbb{R}^q$, determine whether $a^T X$ and $b^T Y$ are uncorrelated for every $a \in \mathbb{R}^p, b\in \mathbb{R}^q$ or not. We give minimax lower bound for this problem (when $p+q,n \to \infty$, $(p+q)/n \leq \kappa < \infty$, without sparsity assumptions). In summary, our results imply that $n$ must be at least as large as $\sqrt {pq}/\|\Sigma_{XY}\|_F^2$ for any procedure (test) to have non-trivial power, where $\Sigma_{XY}$ is the cross-covariance matrix of $X,Y$. We also provide some evidence that the lower bound is tight, by connections to two-sample testing and regression in specific settings.

artificial intelligence, independence testing, machine learning, (15 more...)

arXiv.org Machine Learning

1601.06259

Country: North America > United States (0.29)

Genre: Research Report (0.70)

Industry: Health & Medicine > Therapeutic Area (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback