Goto

Collaborating Authors

 arXiv.org Machine Learning


P-values for high-dimensional regression

arXiv.org Machine Learning

Assigning significance in high-dimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid p-values are not available. An exception is a recent proposal by Wasserman and Roeder (2008) which splits the data into two parts. The number of variables is then reduced to a manageable size using the first split, while classical variable selection techniques can be applied to the remaining variables, using the data from the second split. This yields asymptotic error control under minimal conditions. It involves, however, a one-time random split of the data. Results are sensitive to this arbitrary choice: it amounts to a `p-value lottery' and makes it difficult to reproduce results. Here, we show that inference across multiple random splits can be aggregated, while keeping asymptotic control over the inclusion of noise variables. We show that the resulting p-values can be used for control of both family-wise error (FWER) and false discovery rate (FDR). In addition, the proposed aggregation is shown to improve power while reducing the number of falsely selected variables substantially.


Regularization methods for learning incomplete matrices

arXiv.org Machine Learning

We use convex relaxation techniques to provide a sequence of solutions to the matrix completion problem. Using the nuclear norm as a regularizer, we provide simple and very efficient algorithms for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm iteratively replaces the missing elements with those obtained from a thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions.


Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy

arXiv.org Machine Learning

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. "Structure" has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Beginning with the role of number theory in expressing data, we show how we can naturally proceed to hierarchical structures. We show how this both encapsulates traditional paradigms in data analysis, and also opens up new perspectives towards issues that are on the order of the day, including data mining of massive, high dimensional, heterogeneous data sets. Linkages with other fields are also discussed including computational logic and symbolic dynamics. The structures in data surveyed here are based on hierarchy, represented as p-adic numbers or an ultrametric topology.


Percolation Thresholds of Updated Posteriors for Tracking Causal Markov Processes in Complex Networks

arXiv.org Machine Learning

Percolation on complex networks has been used to study computer viruses, epidemics, and other casual processes. Here, we present conditions for the existence of a network specific, observation dependent, phase transition in the updated posterior of node states resulting from actively monitoring the network. Since traditional percolation thresholds are derived using observation independent Markov chains, the threshold of the posterior should more accurately model the true phase transition of a network, as the updated posterior more accurately tracks the process. These conditions should provide insight into modeling the dynamic response of the updated posterior to active intervention and control policies while monitoring large complex networks.


A more robust boosting algorithm

arXiv.org Machine Learning

We present a new boosting algorithm, motivated by the large margins theory for boosting. We give experimental evidence that the new algorithm is significantly more robust against label noise than existing boosting algorithm.


Deformed Statistics Formulation of the Information Bottleneck Method

arXiv.org Machine Learning

The theoretical basis for a candidate variational principle for the information bottleneck (IB) method is formulated within the ambit of the generalized nonadditive statistics of Tsallis. Given a nonadditivity parameter $ q $, the role of the \textit{additive duality} of nonadditive statistics ($ q^*=2-q $) in relating Tsallis entropies for ranges of the nonadditivity parameter $ q < 1 $ and $ q > 1 $ is described. Defining $ X $, $ \tilde X $, and $ Y $ to be the source alphabet, the compressed reproduction alphabet, and, the \textit{relevance variable} respectively, it is demonstrated that minimization of a generalized IB (gIB) Lagrangian defined in terms of the nonadditivity parameter $ q^* $ self-consistently yields the \textit{nonadditive effective distortion measure} to be the \textit{$ q $-deformed} generalized Kullback-Leibler divergence: $ D_{K-L}^{q}[p(Y|X)||p(Y|\tilde X)] $. This result is achieved without enforcing any \textit{a-priori} assumptions. Next, it is proven that the $q^*-deformed $ nonadditive free energy of the system is non-negative and convex. Finally, the update equations for the gIB method are derived. These results generalize critical features of the IB method to the case of Tsallis statistics.


Multiscale Inference for High-Frequency Data

arXiv.org Machine Learning

This paper proposes a novel multiscale estimator for the integrated volatility of an Ito process, in the presence of market microstructure noise (observation error). The multiscale structure of the observed process is represented frequency-by-frequency and the concept of the multiscale ratio is introduced to quantify the bias in the realized integrated volatility due to the observation error. The multiscale ratio is estimated from a single sample path, and a frequency-by-frequency bias correction procedure is proposed, which simultaneously reduces variance. We extend the method to include correlated observation errors and provide the implied time domain form of the estimation procedure. The new method is implemented to estimate the integrated volatility for the Heston and other models, and the improved performance of our method over existing methods is illustrated by simulation studies.


On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding

arXiv.org Machine Learning

We study the distributions of the LASSO, SCAD, and thresholding estimators, in finite samples and in the large-sample limit. The asymptotic distributions are derived for both the case where the estimators are tuned to perform consistent model selection and for the case where the estimators are tuned to perform conservative model selection. Our findings complement those of Knight and Fu (2000) and Fan and Li (2001). We show that the distributions are typically highly nonnormal regardless of how the estimator is tuned, and that this property persists in large samples. The uniform convergence rate of these estimators is also obtained, and is shown to be slower than 1/root(n) in case the estimator is tuned to perform consistent model selection. An impossibility result regarding estimation of the estimators' distribution function is also provided.


The Redundancy of a Computable Code on a Noncomputable Distribution

arXiv.org Machine Learning

We introduce new definitions of universal and superuniversal computable codes, which are based on a code's ability to approximate Kolmogorov complexity within the prescribed margin for all individual sequences from a given set. Such sets of sequences may be singled out almost surely with respect to certain probability measures. Consider a measure parameterized with a real parameter and put an arbitrary prior on the parameter. The Bayesian measure is the expectation of the parameterized measure with respect to the prior. It appears that a modified Shannon-Fano code for any computable Bayesian measure, which we call the Bayesian code, is superuniversal on a set of parameterized measure-almost all sequences for prior-almost every parameter. According to this result, in the typical setting of mathematical statistics no computable code enjoys redundancy which is ultimately much less than that of the Bayesian code. Thus we introduce another characteristic of computable codes: The catch-up time is the length of data for which the code length drops below the Kolmogorov complexity plus the prescribed margin. Some codes may have smaller catch-up times than Bayesian codes.


Bayesian MAP Model Selection of Chain Event Graphs

arXiv.org Machine Learning

The class of chain event graph models is a generalisation of the class of discrete Bayesian networks, retaining most of the structural advantages of the Bayesian network for model interrogation, propagation and learning, while more naturally encoding asymmetric state spaces and the order in which events happen. In this paper we demonstrate how with complete sampling, conjugate closed form model selection based on product Dirichlet priors is possible, and prove that suitable homogeneity assumptions characterise the product Dirichlet prior on this class of models. We demonstrate our techniques using two educational examples.