Statistical Learning
DART: Dropouts meet Multiple Additive Regression Trees
Rashmi, K. V., Gilad-Bachrach, Ran
Multiple Additive Regression Trees (MART), an ensemble model of boosted regression trees, is known to deliver high prediction accuracy for diverse tasks, and it is widely used in practice. However, it suffers an issue which we call over-specialization, wherein trees added at later iterations tend to impact the prediction of only a few instances, and make negligible contribution towards the remaining instances. This negatively affects the performance of the model on unseen data, and also makes the model over-sensitive to the contributions of the few, initially added tress. We show that the commonly used tool to address this issue, that of shrinkage, alleviates the problem only to a certain extent and the fundamental issue of over-specialization still remains. In this work, we explore a different approach to address the problem that of employing dropouts, a tool that has been recently proposed in the context of learning deep neural networks. We propose a novel way of employing dropouts in MART, resulting in the DART algorithm. We evaluate DART on ranking, regression and classification tasks, using large scale, publicly available datasets, and show that DART outperforms MART in each of the tasks, with a significant margin. We also show that DART overcomes the issue of over-specialization to a considerable extent.
Re-scale boosting for regression and classification
Lin, Shaobo, Wang, Yao, Xu, Lin
Boosting is a learning scheme that combines weak prediction rules to produce a strong composite estimator, with the underlying intuition that one can obtain accurate prediction rules by combining "rough" ones. Although boosting is proved to be consistent and overfitting-resistant, its numerical convergence rate is relatively slow. The aim of this paper is to develop a new boosting strategy, called the re-scale boosting (RBoosting), to accelerate the numerical convergence rate and, consequently, improve the learning performance of boosting. Our studies show that RBoosting possesses the almost optimal numerical convergence rate in the sense that, up to a logarithmic factor, it can reach the minimax nonlinear approximation rate. We then use RBoosting to tackle both the classification and regression problems, and deduce a tight generalization error estimate. The theoretical and experimental results show that RBoosting outperforms boosting in terms of generalization.
Cats & Co: Categorical Time Series Coclustering
Gay, Dominique, Guigourรจs, Romain, Boullรฉ, Marc, Clรฉrot, Fabrice
We suggest a novel method of clustering and exploratory analysis of temporal event sequences data (also known as categorical time series) based on three-dimensional data grid models. A data set of temporal event sequences can be represented as a data set of three-dimensional points, each point is defined by three variables: a sequence identifier, a time value and an event value. Instantiating data grid models to the 3D-points turns the problem into 3D-coclustering. The sequences are partitioned into clusters, the time variable is discretized into intervals and the events are partitioned into clusters. The cross-product of the univariate partitions forms a multivariate partition of the representation space, i.e., a grid of cells and it also represents a nonparametric estimator of the joint distribution of the sequences, time and events dimensions. Thus, the sequences are grouped together because they have similar joint distribution of time and events, i.e., similar distribution of events along the time dimension. The best data grid is computed using a parameter-free Bayesian model selection approach. We also suggest several criteria for exploiting the resulting grid through agglomerative hierarchies, for interpreting the clusters of sequences and characterizing their components through insightful visualizations. Extensive experiments on both synthetic and real-world data sets demonstrate that data grid models are efficient, effective and discover meaningful underlying patterns of categorical time series data.
Semi-Orthogonal Multilinear PCA with Relaxed Start
Principal component analysis (PCA) is an unsupervised method for learning low-dimensional features with orthogonal projections. Multilinear PCA methods extend PCA to deal with multidimensional data (tensors) directly via tensor-to-tensor projection or tensor-to-vector projection (TVP). However, under the TVP setting, it is difficult to develop an effective multilinear PCA method with the orthogonality constraint. This paper tackles this problem by proposing a novel Semi-Orthogonal Multilinear PCA (SO-MPCA) approach. SO-MPCA learns low-dimensional features directly from tensors via TVP by imposing the orthogonality constraint in only one mode. This formulation results in more captured variance and more learned features than full orthogonality. For better generalization, we further introduce a relaxed start (RS) strategy to get SO-MPCA-RS by fixing the starting projection vectors, which increases the bias and reduces the variance of the learning model. Experiments on both face (2D) and gait (3D) data demonstrate that SO-MPCA-RS outperforms other competing algorithms on the whole, and the relaxed start strategy is also effective for other TVP-based PCA methods.
Trees Assembling Mann Whitney Approach for Detecting Genome-wide Joint Association among Low Marginal Effect loci
Wei, Changshuai, Schaid, Daniel J., Lu, Qing
Common complex diseases are likely influenced by the interplay of hundreds, or even thousands, of genetic variants. Converging evidence shows that genetic variants with low marginal effects (LME) play an important role in disease development. Despite their potential significance, discovering LME genetic variants and assessing their joint association on high dimensional data (e.g., genome wide association studies) remain a great challenge. To facilitate joint association analysis among a large ensemble of LME genetic variants, we proposed a computationally efficient and powerful approach, which we call Trees Assembling Mann whitney (TAMW). Through simulation studies and an empirical data application, we found that TAMW outperformed multifactor dimensionality reduction (MDR) and the likelihood ratio based Mann whitney approach (LRMW) when the underlying complex disease involves multiple LME loci and their interactions. For instance, in a simulation with 20 interacting LME loci, TAMW attained a higher power (power=0.931) than both MDR (power=0.599) and LRMW (power=0.704). In an empirical study of 29 known Crohn's disease (CD) loci, TAMW also identified a stronger joint association with CD than those detected by MDR and LRMW. Finally, we applied TAMW to Wellcome Trust CD GWAS to conduct a genome wide analysis. The analysis of 459K single nucleotide polymorphisms was completed in 40 hours using parallel computing, and revealed a joint association predisposing to CD (p-value=2.763e-19). Further analysis of the newly discovered association suggested that 13 genes, such as ATG16L1 and LACC1, may play an important role in CD pathophysiological and etiological processes.
Achieving a Hyperlocal Housing Price Index: Overcoming Data Sparsity by Bayesian Dynamical Modeling of Multiple Data Streams
Ren, You, Fox, Emily B., Bruce, Andrew
Understanding how housing values evolve over time is important to policy makers, consumers and real estate professionals. Existing methods for constructing housing indices are computed at a coarse spatial granularity, such as metropolitan regions, which can mask or distort price dynamics apparent in local markets, such as neighborhoods and census tracts. A challenge in moving to estimates at, for example, the census tract level is the sparsity of spatiotemporally localized house sales observations. Our work aims at addressing this challenge by leveraging observations from multiple census tracts discovered to have correlated valuation dynamics. Our proposed Bayesian nonparametric approach builds on the framework of latent factor models to enable a flexible, data-driven method for inferring the clustering of correlated census tracts. We explore methods for scalability and parallelizability of computations, yielding a housing valuation index at the level of census tract rather than zip code, and on a monthly basis rather than quarterly. Our analysis is provided on a large Seattle metropolitan housing dataset.
Support Vector Machines for Current Status Data
Travis-Lumer, Yael, Goldberg, Yair
In this paper we aim to develop a general, model free, method for analyzing current status data using machine learning techniques. In particular, we propose a support vector machine (SVM) learning method for estimation of the failure time expectation for current status data. SVM was originally introduced by Vapnik in the 1990's and is firmly related to statistical learning theory (Vapnik, 1999). The choice of SVMs for current status data is motivated by the fact that SVMs can be implemented easily, have fast training speed, produce decision functions that have a strong generalization ability and can guarantee convergence to the optimal solution, under some weak assumptions (Shivaswamy et al., 2007). Current status data is a data format where the failure timeT is restricted to knowledge of whether or notT exceeds a random monitoring timeC . This data format is quite common and includes examples from various fields. Jewell and van der Laan (2004) mention a few examples including: studying the distribution of the age of a child at weaning given observation points; when conducting a partner study of HIV infection over a number of clinic visits; and when a tumor under investigation is occult and an animal is sacrificed at a certain time point in order to determine presence or absence of the tumor.
Generalized Low Rank Models
Udell, Madeleine, Horn, Corinne, Zadeh, Reza, Boyd, Stephen
Principal components analysis (PCA) is a well-known technique for approximating a tabular data set by a low rank matrix. Here, we extend the idea of PCA to handle arbitrary data sets consisting of numerical, Boolean, categorical, ordinal, and other data types. This framework encompasses many well known techniques in data analysis, such as nonnegative matrix factorization, matrix completion, sparse and robust PCA, $k$-means, $k$-SVD, and maximum margin matrix factorization. The method handles heterogeneous data sets, and leads to coherent schemes for compressing, denoising, and imputing missing entries across all data types simultaneously. It also admits a number of interesting interpretations of the low rank factors, which allow clustering of examples or of features. We propose several parallel algorithms for fitting generalized low rank models, and describe implementations and numerical results.
Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs
Tolpin, David, van de Meent, Jan Willem, Paige, Brooks, Wood, Frank
We introduce an adaptive output-sensitive Metropolis-Hastings algorithm for probabilistic models expressed as programs, Adaptive Lightweight Metropolis-Hastings (AdLMH). The algorithm extends Lightweight Metropolis-Hastings (LMH) by adjusting the probabilities of proposing random variables for modification to improve convergence of the program output. We show that AdLMH converges to the correct equilibrium distribution and compare convergence of AdLMH to that of LMH on several test problems to highlight different aspects of the adaptation scheme. We observe consistent improvement in convergence on the test problems.
Penalized versus constrained generalized eigenvalue problems
Gaynanova, Irina, Booth, James, Wells, Martin T.
We investigate the difference between using an $\ell_1$ penalty versus an $\ell_1$ constraint in generalized eigenvalue problems, such as principal component analysis and discriminant analysis. Our main finding is that an $\ell_1$ penalty may fail to provide very sparse solutions; a severe disadvantage for variable selection that can be remedied by using an $\ell_1$ constraint. Our claims are supported both by empirical evidence and theoretical analysis. Finally, we illustrate the advantages of an $\ell_1$ constraint in the context of discriminant analysis and principal component analysis.