arXiv.org Machine Learning
Dimension reduction and variable selection in case control studies via regularized likelihood optimization
Bunea, Florentina, Barbu, Adrian
Dimension reduction and variable selection are performed routinely in case-control studies, but the literature on the theoretical aspects of the resulting estimates is scarce. We bring our contribution to this literature by studying estimators obtained via L1 penalized likelihood optimization. We show that the optimizers of the L1 penalized retrospective likelihood coincide with the optimizers of the L1 penalized prospective likelihood. This extends the results of Prentice and Pyke (1979), obtained for non-regularized likelihoods. We establish both the sup-norm consistency of the odds ratio, after model selection, and the consistency of subset selection of our estimators. The novelty of our theoretical results consists in the study of these properties under the case-control sampling scheme. Our results hold for selection performed over a large collection of candidate variables, with cardinality allowed to depend and be greater than the sample size. We complement our theoretical results with a novel approach of determining data driven tuning parameters, based on the bisection method. The resulting procedure offers significant computational savings when compared with grid search based methods. All our numerical experiments support strongly our theoretical findings.
Data spectroscopy: Eigenspaces of convolution operators and clustering
Shi, Tao, Belkin, Mikhail, Yu, Bin
This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.
Likelihood-based semi-supervised model selection with applications to speech processing
White, Christopher M., Khudanpur, Sanjeev P., Wolfe, Patrick J.
In conventional supervised pattern recognition tasks, model selection is typically accomplished by minimizing the classification error rate on a set of so-called development data, subject to ground-truth labeling by human experts or some other means. In the context of speech processing systems and other large-scale practical applications, however, such labeled development data are typically costly and difficult to obtain. This article proposes an alternative semi-supervised framework for likelihood-based model selection that leverages unlabeled data by using trained classifiers representing each model to automatically generate putative labels. The errors that result from this automatic labeling are shown to be amenable to results from robust statistics, which in turn provide for minimax-optimal censored likelihood ratio tests that recover the nonparametric sign test as a limiting case. This approach is then validated experimentally using a state-of-the-art automatic speech recognition system to select between candidate word pronunciations using unlabeled speech data that only potentially contain instances of the words under test. Results provide supporting evidence for the utility of this approach, and suggest that it may also find use in other applications of machine learning.
A Geometric Approach to Sample Compression
Rubinstein, Benjamin I. P., Rubinstein, J. Hyam
The Sample Compression Conjecture of Littlestone & Warmuth has remained unsolved for over two decades. This paper presents a systematic geometric investigation of the compression of finite maximum concept classes. Simple arrangements of hyperplanes in Hyperbolic space, and Piecewise-Linear hyperplane arrangements, are shown to represent maximum classes, generalizing the corresponding Euclidean result. A main result is that PL arrangements can be swept by a moving hyperplane to unlabeled d-compress any finite maximum class, forming a peeling scheme as conjectured by Kuzmin & Warmuth. A corollary is that some d-maximal classes cannot be embedded into any maximum class of VC dimension d+k, for any constant k. The construction of the PL sweeping involves Pachner moves on the one-inclusion graph, corresponding to moves of a hyperplane across the intersection of d other hyperplanes. This extends the well known Pachner moves for triangulations to cubical complexes.
High-dimensional additive modeling
Meier, Lukas, van de Geer, Sara, Bühlmann, Peter
We propose a new sparsity-smoothness penalty for high-dimensional generalized additive models. The combination of sparsity and smoothness is crucial for mathematical theory as well as performance for finite-sample data. We present a computationally efficient algorithm, with provable numerical convergence properties, for optimizing the penalized likelihood. Furthermore, we provide oracle results which yield asymptotic optimality of our estimator for high dimensional but sparse additive models. Finally, an adaptive version of our sparsity-smoothness penalized approach yields large additional performance gains.
A Hierarchical Bayesian Model for Frame Representation
Chaâri, L., Pesquet, J. -C., Tourneret, J. -Y., Ciuciu, Ph., Benazza-Benyahia, A.
In many signal processing problems, it may be fruitful to represent the signal under study in a frame. If a probabilistic approach is adopted, it becomes then necessary to estimate the hyper-parameters characterizing the probability distribution of the frame coefficients. This problem is difficult since in general the frame synthesis operator is not bijective. Consequently, the frame coefficients are not directly observable. This paper introduces a hierarchical Bayesian model for frame representation. The posterior distribution of the frame coefficients and model hyper-parameters is derived. Hybrid Markov Chain Monte Carlo algorithms are subsequently proposed to sample from this posterior distribution. The generated samples are then exploited to estimate the hyper-parameters and the frame coefficients of the target signal. Validation experiments show that the proposed algorithms provide an accurate estimation of the frame coefficients and hyper-parameters. Application to practical problems of image denoising show the impact of the resulting Bayesian estimation on the recovered signal quality.
Causal Inference on Discrete Data using Additive Noise Models
Peters, Jonas, Janzing, Dominik, Schölkopf, Bernhard
Inferring the causal structure of a set of random variables from a finite sample of the joint distribution is an important problem in science. Recently, methods using additive noise models have been suggested to approach the case of continuous variables. In many situations, however, the variables of interest are discrete or even have only finitely many states. In this work we extend the notion of additive noise models to these cases. We prove that whenever the joint distribution $\prob^{(X,Y)}$ admits such a model in one direction, e.g. $Y=f(X)+N, N \independent X$, it does not admit the reversed model $X=g(Y)+\tilde N, \tilde N \independent Y$ as long as the model is chosen in a generic way. Based on these deliberations we propose an efficient new algorithm that is able to distinguish between cause and effect for a finite sample of discrete variables. In an extensive experimental study we show that this algorithm works both on synthetic and real data sets.
Which graphical models are difficult to learn?
Bento, Jose, Montanari, Andrea
We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it).
Distinguishing Cause and Effect via Second Order Exponential Models
Janzing, Dominik, Sun, Xiaohai, Schoelkopf, Bernhard
We propose a method to infer causal structures containing both discrete and continuous variables. The idea is to select causal hypotheses for which the conditional density of every variable, given its causes, becomes smooth. We define a family of smooth densities and conditional densities by second order exponential models, i.e., by maximizing conditional entropy subject to first and second statistical moments. If some of the variables take only values in proper subsets of R^n, these conditionals can induce different families of joint distributions even for Markov-equivalent graphs. We consider the case of one binary and one real-valued variable where the method can distinguish between cause and effect. Using this example, we describe that sometimes a causal hypothesis must be rejected because P(effect|cause) and P(cause) share algorithmic information (which is untypical if they are chosen independently). This way, our method is in the same spirit as faithfulness-based causal inference because it also rejects non-generic mutual adjustments among DAG-parameters.
The Cyborg Astrobiologist: Testing a Novelty-Detection Algorithm on Two Mobile Exploration Systems at Rivas Vaciamadrid in Spain and at the Mars Desert Research Station in Utah
McGuire, P. C., Gross, C., Wendt, L., Bonnici, A., Souza-Egipsy, V., Ormo, J., Diaz-Martinez, E., Foing, B. H., Bose, R., Walter, S., Oesker, M., Ontrup, J., Haschke, R., Ritter, H.
(ABRIDGED) In previous work, two platforms have been developed for testing computer-vision algorithms for robotic planetary exploration (McGuire et al. 2004b,2005; Bartolo et al. 2007). The wearable-computer platform has been tested at geological and astrobiological field sites in Spain (Rivas Vaciamadrid and Riba de Santiuste), and the phone-camera has been tested at a geological field site in Malta. In this work, we (i) apply a Hopfield neural-network algorithm for novelty detection based upon color, (ii) integrate a field-capable digital microscope on the wearable computer platform, (iii) test this novelty detection with the digital microscope at Rivas Vaciamadrid, (iv) develop a Bluetooth communication mode for the phone-camera platform, in order to allow access to a mobile processing computer at the field sites, and (v) test the novelty detection on the Bluetooth-enabled phone-camera connected to a netbook computer at the Mars Desert Research Station in Utah. This systems engineering and field testing have together allowed us to develop a real-time computer-vision system that is capable, for example, of identifying lichens as novel within a series of images acquired in semi-arid desert environments. We acquired sequences of images of geologic outcrops in Utah and Spain consisting of various rock types and colors to test this algorithm. The algorithm robustly recognized previously-observed units by their color, while requiring only a single image or a few images to learn colors as familiar, demonstrating its fast learning capability.