Goto

Collaborating Authors

 Country


Quick sensitivity analysis for incremental data modification and its application to leave-one-out CV in linear classification problems

arXiv.org Machine Learning

We introduce a novel sensitivity analysis framework for large scale classification problems that can be used when a small number of instances are incrementally added or removed. For quickly updating the classifier in such a situation, incremental learning algorithms have been intensively studied in the literature. Although they are much more efficient than solving the optimization problem from scratch, their computational complexity yet depends on the entire training set size. It means that, if the original training set is large, completely solving an incremental learning problem might be still rather expensive. To circumvent this computational issue, we propose a novel framework that allows us to make an inference about the updated classifier without actually re-optimizing it. Specifically, the proposed framework can quickly provide a lower and an upper bounds of a quantity on the unknown updated classifier. The main advantage of the proposed framework is that the computational cost of computing these bounds depends only on the number of updated instances. This property is quite advantageous in a typical sensitivity analysis task where only a small number of instances are updated. In this paper we demonstrate that the proposed framework is applicable to various practical sensitivity analysis tasks, and the bounds provided by the framework are often sufficiently tight for making desired inferences.


Generalized Correntropy for Robust Adaptive Filtering

arXiv.org Machine Learning

As a robust nonlinear similarity measure in kernel space, correntropy has received increasing attention in domains of machine learning and signal processing. In particular, the maximum correntropy criterion (MCC) has recently been successfully applied in robust regression and filtering. The default kernel function in correntropy is the Gaussian kernel, which is, of course, not always the best choice. In this work, we propose a generalized correntropy that adopts the generalized Gaussian density (GGD) function as the kernel (not necessarily a Mercer kernel), and present some important properties. We further propose the generalized maximum correntropy criterion (GMCC), and apply it to adaptive filtering. An adaptive algorithm, called the GMCC algorithm, is derived, and the mean square convergence performance is studied. We show that the proposed algorithm is very stable and can achieve zero probability of divergence (POD). Simulation results confirm the theoretical expectations and demonstrate the desirable performance of the new algorithm.


On the Convergence Properties of Optimal AdaBoost

arXiv.org Artificial Intelligence

AdaBoost is one of the most popular machine-learning algorithms. It is simple to implement and often found very effective by practitioners, while still being mathematically elegant and theoretically sound. AdaBoost's behavior in practice, and in particular the test-error behavior, has puzzled many eminent researchers for over a decade: It seems to defy our general intuition in machine learning regarding the fundamental trade-off between model complexity and generalization performance. In this paper, we establish the convergence of "Optimal AdaBoost," a term coined by Rudin, Daubechies, and Schapire in 2004. We prove the convergence, with the number of rounds, of the classifier itself, its generalization error, and its resulting margins for fixed data sets, under certain reasonable conditions. More generally, we prove that the time/per-round average of almost any function of the example weights converges. Our approach is to frame AdaBoost as a dynamical system, to provide sufficient conditions for the existence of an invariant measure, and to employ tools from ergodic theory. Unlike previous work, we do not assume AdaBoost cycles; actually, we present empirical evidence against it on real-world datasets. Our main theoretical results hold under a weaker condition. We show sufficient empirical evidence that Optimal AdaBoost always met the condition on every real-world dataset we tried. Our results formally ground future convergence-rate analyses, and may even provide opportunities for slight algorithmic modifications to optimize the generalization ability of AdaBoost classifiers, thus reducing a practitioner's burden of deciding how long to run the algorithm.


Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks

arXiv.org Machine Learning

Complex biological systems have been successfully modeled by biochemical and genetic interaction networks, typically gathered from high-throughput (HTP) data. These networks can be used to infer functional relationships between genes or proteins. Using the intuition that the topological role of a gene in a network relates to its biological function, local or diffusion based "guilt-by-association" and graph-theoretic methods have had success in inferring gene functions. Here we seek to improve function prediction by integrating diffusion-based methods with a novel dimensionality reduction technique to overcome the incomplete and noisy nature of network data. In this paper, we introduce diffusion component analysis (DCA), a framework that plugs in a diffusion model and learns a low-dimensional vector representation of each node to encode the topological properties of a network. As a proof of concept, we demonstrate DCA's substantial improvement over state-of-the-art diffusion-based approaches in predicting protein function from molecular interaction networks. Moreover, our DCA framework can integrate multiple networks from heterogeneous sources, consisting of genomic information, biochemical experiments and other resources, to even further improve function prediction. Yet another layer of performance gain is achieved by integrating the DCA framework with support vector machines that take our node vector representations as features. Overall, our DCA framework provides a novel representation of nodes in a network that can be used as a plug-in architecture to other machine learning algorithms to decipher topological properties of and obtain novel insights into interactomes.


Scheduled denoising autoencoders

arXiv.org Machine Learning

We present a representation learning method that learns features at multiple different levels of scale. Working within the unsupervised framework of denoising autoencoders, we observe that when the input is heavily corrupted during training, the network tends to learn coarse-grained features, whereas when the input is only slightly corrupted, the network tends to learn fine-grained features. This motivates the scheduled denoising autoencoder, which starts with a high level of noise that lowers as training progresses. We find that the resulting representation yields a significant boost on a later supervised task compared to the original input, or to a standard denoising autoencoder trained at a single noise level. After supervised fine-tuning our best model achieves the lowest ever reported error on the CIFAR-10 data set among permutation-invariant methods.


Gradient of Probability Density Functions based Contrasts for Blind Source Separation (BSS)

arXiv.org Machine Learning

The article derives some novel independence measures and contrast functions for Blind Source Separation (BSS) application. For the $k^{th}$ order differentiable multivariate functions with equal hyper-volumes (region bounded by hyper-surfaces) and with a constraint of bounded support for $k>1$, it proves that equality of any $k^{th}$ order derivatives implies equality of the functions. The difference between product of marginal Probability Density Functions (PDFs) and joint PDF of a random vector is defined as Function Difference (FD) of a random vector. Assuming the PDFs are $k^{th}$ order differentiable, the results on generalized functions are applied to the independence condition. This brings new sets of independence measures and BSS contrasts based on the $L^p$-Norm, $ p \geq 1$ of - FD, gradient of FD (GFD) and Hessian of FD (HFD). Instead of a conventional two stage indirect estimation method for joint PDF based BSS contrast estimation, a single stage direct estimation of the contrasts is desired. The article targets both the efficient estimation of the proposed contrasts and extension of the potential theory for an information field. The potential theory has a concept of reference potential and it is used to derive closed form expression for the relative analysis of potential field. Analogous to it, there are introduced concepts of Reference Information Potential (RIP) and Cross Reference Information Potential (CRIP) based on the potential due to kernel functions placed at selected sample points as basis in kernel methods. The quantities are used to derive closed form expressions for information field analysis using least squares. The expressions are used to estimate $L^2$-Norm of FD and $L^2$-Norm of GFD based contrasts.


High-Dimensional Classification for Brain Decoding

arXiv.org Machine Learning

Brain decoding involves the determination of a subject's cognitive state or an associated stimulus from functional neuroimaging data measuring brain activity. In this setting the cognitive state is typically characterized by an element of a finite set, and the neuroimaging data comprise voluminous amounts of spatiotemporal data measuring some aspect of the neural signal. The associated statistical problem is one of classification from high-dimensional data. We explore the use of functional principal component analysis, mutual information networks, and persistent homology for examining the data through exploratory analysis and for constructing features characterizing the neural signal for brain decoding. We review each approach from this perspective, and we incorporate the features into a classifier based on symmetric multinomial logistic regression with elastic net regularization. The approaches are illustrated in an application where the task is to infer, from brain activity measured with magnetoencephalography (MEG), the type of video stimulus shown to a subject.


Deep Narrow Boltzmann Machines are Universal Approximators

arXiv.org Machine Learning

We show that deep narrow Boltzmann machines are universal approximators of probability distributions on the activities of their visible units, provided they have sufficiently many hidden layers, each containing the same number of units as the visible layer. We show that, within certain parameter domains, deep Boltzmann machines can be studied as feedforward networks. We provide upper and lower bounds on the sufficient depth and width of universal approximators. These results settle various intuitions regarding undirected networks and, in particular, they show that deep narrow Boltzmann machines are at least as compact universal approximators as narrow sigmoid belief networks and restricted Boltzmann machines, with respect to the currently available bounds for those models.


Sample Complexity of Dictionary Learning and other Matrix Factorizations

arXiv.org Machine Learning

Many modern tools in machine learning and signal processing, such as sparse dictionary learning, principal component analysis (PCA), non-negative matrix factorization (NMF), $K$-means clustering, etc., rely on the factorization of a matrix obtained by concatenating high-dimensional vectors from a training collection. While the idealized task would be to optimize the expected quality of the factors over the underlying distribution of training vectors, it is achieved in practice by minimizing an empirical average over the considered collection. The focus of this paper is to provide sample complexity estimates to uniformly control how much the empirical average deviates from the expected cost function. Standard arguments imply that the performance of the empirical predictor also exhibit such guarantees. The level of genericity of the approach encompasses several possible constraints on the factors (tensor product structure, shift-invariance, sparsity \ldots), thus providing a unified perspective on the sample complexity of several widely used matrix factorization schemes. The derived generalization bounds behave proportional to $\sqrt{\log(n)/n}$ w.r.t.\ the number of samples $n$ for the considered matrix factorization techniques.


`local' vs. `global' parameters -- breaking the gaussian complexity barrier

arXiv.org Machine Learning

We show that if $F$ is a convex class of functions that is $L$-subgaussian, the error rate of learning problems generated by independent noise is equivalent to a fixed point determined by `local' covering estimates of the class, rather than by the gaussian averages. To that end, we establish new sharp upper and lower estimates on the error rate for such problems.