Statistical Learning
Semantic Feature Representation to Capture News Impact
Xie, Boyi (Columbia University) | Wang, Dingquan (Columbia University) | Passonneau, Rebecca J. (Columbia University)
This paper presents a study where semantic frames are used to mine financial news so as to quantify the impact of news on the stock market. We represent news documents in a novel semantic tree structure and use tree kernel support vector machines to predict the change of stock price. We achieve an efficient computation through linearization of tree kernels. In addition to two binary classification tasks, we rank news items according to their probability to affect change of price using two ranking methods that require vector space features. We evaluate our rankings based on receiver operating characteristic curves and analyze the predictive power of our semantic features. For both learning tasks, the proposed semantic features provide superior results.
A Structural Approach to Coordinate-Free Statistics
LaGatta, Tom, Hahn, P. Richard
We consider the question of learning in general topological vector spaces. By exploiting known (or parametrized) covariance structures, our Main Theorem demonstrates that any continuous linear map corresponds to a certain isomorphism of embedded Hilbert spaces. By inverting this isomorphism and extending continuously, we construct a version of the Ordinary Least Squares estimator in absolute generality. Our Gauss-Markov theorem demonstrates that OLS is a "best linear unbiased estimator", extending the classical result. We construct a stochastic version of the OLS estimator, which is a continuous disintegration exactly for the class of "uncorrelated implies independent" (UII) measures. As a consequence, Gaussian measures always exhibit continuous disintegrations through continuous linear maps, extending a theorem of the first author. Applying this framework to some problems in machine learning, we prove a useful representation theorem for covariance tensors, and show that OLS defines a good kriging predictor for vector-valued arrays on general index spaces. We also construct a support-vector machine classifier in this setting. We hope that our article shines light on some deeper connections between probability theory, statistics and machine learning, and may serve as a point of intersection for these three communities.
Ridge Fusion in Statistical Learning
Price, Bradley S., Geyer, Charles J., Rothman, Adam J.
We propose a penalized likelihood method to jointly estimate multiple precision matrices for use in quadratic discriminant analysis and model based clustering. A ridge penalty and a ridge fusion penalty are used to introduce shrinkage and promote similarity between precision matrix estimates. Block-wise coordinate descent is used for optimization, and validation likelihood is used for tuning parameter selection. Our method is applied in quadratic discriminant analysis and semi-supervised model based clustering.
Robust Subspace Outlier Detection in High Dimensional Space
Rare data in a large-scale database are called outliers that reveal significant information in the real world. The subspace-based outlier detection is regarded as a feasible approach in very high dimensional space. However, the outliers found in subspaces are only part of the true outliers in high dimensional space, indeed. The outliers hidden in normal-clustered points are sometimes neglected in the projected dimensional subspace. In this paper, we propose a robust subspace method for detecting such inner outliers in a given dataset, which uses two dimensional-projections: detecting outliers in subspaces with local density ratio in the first projected dimensions; finding outliers by comparing neighbor's positions in the second projected dimensions. Each point's weight is calculated by summing up all related values got in the two steps projected dimensions, and then the points scoring the largest weight values are taken as outliers. By taking a series of experiments with the number of dimensions from 10 to 10000, the results show that our proposed method achieves high precision in the case of extremely high dimensional space, and works well in low dimensional space.
Perceptron-like Algorithms and Generalization Bounds for Learning to Rank
Chaudhuri, Sougata, Tewari, Ambuj
Learning to rank is a supervised learning problem where the output space is the space of rankings but the supervision space is the space of relevance scores. We make theoretical contributions to the learning to rank problem both in the online and batch settings. First, we propose a perceptron-like algorithm for learning a ranking function in an online setting. Our algorithm is an extension of the classic perceptron algorithm for the classification problem. Second, in the setting of batch learning, we introduce a sufficient condition for convex ranking surrogates to ensure a generalization bound that is independent of number of objects per query. Our bound holds when linear ranking functions are used: a common practice in many learning to rank algorithms. En route to developing the online algorithm and generalization bound, we propose a novel family of listwise large margin ranking surrogates. Our novel surrogate family is obtained by modifying a well-known pairwise large margin ranking surrogate and is distinct from the listwise large margin surrogates developed using the structured prediction framework. Using the proposed family, we provide a guaranteed upper bound on the cumulative NDCG (or MAP) induced loss under the perceptron-like algorithm. We also show that the novel surrogates satisfy the generalization bound condition.
Automated Attribution and Intertextual Analysis
Brofos, James, Kannan, Ajay, Shu, Rui
In this work, we employ quantitative methods from the realm of statistics and machine learning to develop novel methodologies for author attribution and textual analysis. In particular, we develop techniques and software suitable for applications to Classical study, and we illustrate the efficacy of our approach in several interesting open questions in the field. We apply our numerical analysis techniques to questions of authorship attribution in the case of the Greek tragedian Euripides, to instances of intertextuality and influence in the poetry of the Roman statesman Seneca the Younger, and to cases of "interpolated" text with respect to the histories of Livy.
Why (and When and How) Contrastive Divergence Works
Contrastive divergence (CD) is a promising method of inference in high dimensional distributions with intractable normalizing constants, however, the theoretical foundations justifying its use are somewhat shaky. This document proposes a framework for understanding CD inference, how/when it works, and provides multiple justifications for the CD moment conditions, including framing them as a variational approximation. Algorithms for performing inference are discussed and are applied to social network data using an exponential-family random graph models (ERGM). The framework also provides guidance about how to construct MCMC kernels providing good CD inference, which turn out to be quite different from those used typically to provide fast global mixing.
Nested Hierarchical Dirichlet Processes
Paisley, John, Wang, Chong, Blei, David M., Jordan, Michael I.
We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP is a generalization of the nested Chinese restaurant process (nCRP) that allows each word to follow its own path to a topic node according to a document-specific distribution on a shared tree. This alleviates the rigid, single-path formulation of the nCRP, allowing a document to more easily express thematic borrowings as a random effect. We derive a stochastic variational inference algorithm for the model, in addition to a greedy subtree selection method for each document, which allows for efficient inference using massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 3.3 million documents from Wikipedia.
Fast MLE Computation for the Dirichlet Multinomial
Given a collection of categorical data, we want to find the parameters of a Dirichlet distribution which maximizes the likelihood of that data. Newton's method is typically used for this purpose but current implementations require reading through the entire dataset on each iteration. In this paper, we propose a modification which requires only a single pass through the dataset and substantially decreases running time. Furthermore we analyze both theoretically and empirically the performance of the proposed algorithm, and provide an open source implementation.
Sparse Principal Component Analysis via Rotation and Truncation
Hu, Zhenfang, Pan, Gang, Wang, Yueming, Wu, Zhaohui
Sparse principal component analysis (sparse PCA) aims at finding a sparse basis to improve the interpretability over the dense basis of PCA, meanwhile the sparse basis should cover the data subspace as much as possible. In contrast to most of existing work which deal with the problem by adding some sparsity penalties on various objectives of PCA, in this paper, we propose a new method SPCArt, whose motivation is to find a rotation matrix and a sparse basis such that the sparse basis approximates the basis of PCA after the rotation. The algorithm of SPCArt consists of three alternating steps: rotate PCA basis, truncate small entries, and update the rotation matrix. Its performance bounds are also given. SPCArt is efficient, with each iteration scaling linearly with the data dimension. It is easy to choose parameters in SPCArt, due to its explicit physical explanations. Besides, we give a unified view to several existing sparse PCA methods and discuss the connection with SPCArt. Some ideas in SPCArt are extended to GPower, a popular sparse PCA algorithm, to overcome its drawback. Experimental results demonstrate that SPCArt achieves the state-of-the-art performance. It also achieves a good tradeoff among various criteria, including sparsity, explained variance, orthogonality, balance of sparsity among loadings, and computational speed.