Statistical Learning
Condition Number Analysis of Kernel-based Density Ratio Estimation
Kanamori, Takafumi, Suzuki, Taiji, Sugiyama, Masashi
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, several methods of directly estimating the density ratio have been developed, e.g., kernel mean matching, maximum likelihood density ratio estimation, and least-squares density ratio fitting. In this paper, we consider a kernelized variant of the least-squares method and investigate its theoretical properties from the viewpoint of the condition number using smoothed analysis techniques--the condition number of the Hessian matrix determines the convergence rate of optimization and the numerical stability. We show that the kernel least-squares method has a smaller condition number than a version of kernel mean matching and other M-estimators, implying that the kernel least-squares method has preferable numerical properties. We further give an alternative formulation of the kernel least-squares estimator which is shown to possess an even smaller condition number. We show that numerical studies meet our theoretical analysis.
Under-determined reverberant audio source separation using a full-rank spatial covariance model
Duong, Ngoc, Vincent, Emmanuel, Gribonval, Remi
This article addresses the modeling of reverberant recording environments in the context of under-determined convolutive blind source separation. We model the contribution of each source to all mixture channels in the time-frequency domain as a zero-mean Gaussian random variable whose covariance encodes the spatial characteristics of the source. We then consider four specific covariance models, including a full-rank unconstrained model. We derive a family of iterative expectationmaximization (EM) algorithms to estimate the parameters of each model and propose suitable procedures to initialize the parameters and to align the order of the estimated sources across all frequency bins based on their estimated directions of arrival (DOA). Experimental results over reverberant synthetic mixtures and live recordings of speech data show the effectiveness of the proposed approach.
Learning an Interactive Segmentation System
Nickisch, Hannes, Kohli, Pushmeet, Rother, Carsten
Many successful applications of computer vision to image or video manipulation are interactive by nature. However, parameters of such systems are often trained neglecting the user. Traditionally, interactive systems have been treated in the same manner as their fully automatic counterparts. Their performance is evaluated by computing the accuracy of their solutions under some fixed set of user interactions. This paper proposes a new evaluation and learning method which brings the user in the loop. It is based on the use of an active robot user - a simulated model of a human user. We show how this approach can be used to evaluate and learn parameters of state-of-the-art interactive segmentation systems. We also show how simulated user models can be integrated into the popular max-margin method for parameter learning and propose an algorithm to solve the resulting optimisation problem.
A Model-Based Approach to Predicting Predator-Prey & Friend-Foe Relationships in Ant Colonies
Understanding predator-prey relationships among insects is a challenging task in the domain of insect-colony research. This is due to several factors involved, such as determining whether a particular behavior is the result of a predator-prey interaction, a friend-foe interaction or another kind of interaction. In this paper, we analyze a series of predator-prey and friend-foe interactions in two colonies of carpenter ants to better understand and predict such behavior. Using the data gathered, we have also come up with a preliminary model for predicting such behavior under the specific conditions the experiment was conducted in. In this paper, we present the results of our data analysis as well as an overview of the processes involved.
Modeling sparse connectivity between underlying brain sources for EEG/MEG
Haufe, Stefan, Tomioka, Ryota, Nolte, Guido, Mueller, Klaus-Robert, Kawanabe, Motoaki
A. Functional brain connectivity The analysis of neural connectivity plays a crucial role for understanding the general functioning of the brain. In the past two decades such analysis has become possible thanks to tremendous progress that has been made in the fields of neuroimaging and mathematical modeling. Today, a multiplicity of imaging modalities exists, allowing to monitor brain dynamics at different spatial and temporal scales. Given multiple simultaneously-recorded time-series reflecting neural activity in different brain regions, a functional (taskrelated) connection (sometimes also called information flow or (causal) interaction in this paper) between two regions is commonly inferred, if a significant time-lagged influence between the corresponding time-series is found. Different measures have been proposed for quantifying this influence, most of them being formulated either in terms of the cross-spectrum (e.g., coherence, phase slope index [1]) or an autoregressive models (e.g., Granger causality [2], directed transfer function [3], partial directed coherence [4], [5]). B. Volume conduction problem in EEG and MEG In electroencephalography (EEG) and magnetoencephalography (MEG), sensors are placed outside the head and the problem of volume conduction arises. That is, rather than measuring activity of only one brain site, each sensor captures a linear superposition of signals from all over the brain. This mixing introduces instantaneous correlations in the data, which can cause traditional analyses to detect spurious connectivity [6].
Closing the Learning-Planning Loop with Predictive State Representations
Boots, Byron, Siddiqi, Sajid M., Gordon, Geoffrey J.
A central problem in artificial intelligence is that of planning to maximize future reward under uncertainty in a partially observable environment. In this paper we propose and demonstrate a novel algorithm which accurately learns a model of such an environment directly from sequences of action-observation pairs. We then close the loop from observations to actions by planning in the learned model and recovering a policy which is near-optimal in the original environment. Specifically, we present an efficient and statistically consistent spectral algorithm for learning the parameters of a Predictive State Representation (PSR). We demonstrate the algorithm by learning a model of a simulated high-dimensional, vision-based mobile robot planning task, and then perform approximate point-based planning in the learned PSR. Analysis of our results shows that the algorithm learns a state space which efficiently captures the essential features of the environment. This representation allows accurate prediction with a small number of parameters, and enables successful and efficient planning.
Hyper-sparse optimal aggregation
Gaïffas, Stéphane, Lecué, Guillaume
In this paper, we consider the problem of "hyper-sparse aggregation". Namely, given a dictionary $F = \{f_1, ..., f_M \}$ of functions, we look for an optimal aggregation algorithm that writes $\tilde f = \sum_{j=1}^M \theta_j f_j$ with as many zero coefficients $\theta_j$ as possible. This problem is of particular interest when $F$ contains many irrelevant functions that should not appear in $\tilde{f}$. We provide an exact oracle inequality for $\tilde f$, where only two coefficients are non-zero, that entails $\tilde f$ to be an optimal aggregation algorithm. Since selectors are suboptimal aggregation procedures, this proves that 2 is the minimal number of elements of $F$ required for the construction of an optimal aggregation procedures in every situations. A simulated example of this algorithm is proposed on a dictionary obtained using LARS, for the problem of selection of the regularization parameter of the LASSO. We also give an example of use of aggregation to achieve minimax adaptation over anisotropic Besov spaces, which was not previously known in minimax theory (in regression on a random design).
How to Explain Individual Classification Decisions
Baehrens, David, Schroeter, Timon, Harmeling, Stefan, Kawanabe, Motoaki, Hansen, Katja, Mueller, Klaus-Robert
After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted the particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.
Positive Definite Kernels in Machine Learning
This survey is an introduction to positive definite kernels and the set of methods they have inspired in the machine learning literature, namely kernel methods. We first discuss some properties of positive definite kernels as well as reproducing kernel Hibert spaces, the natural extension of the set of functions $\{k(x,\cdot),x\in\mathcal{X}\}$ associated with a kernel $k$ defined on a space $\mathcal{X}$. We discuss at length the construction of kernel functions that take advantage of well-known statistical models. We provide an overview of numerous data-analysis methods which take advantage of reproducing kernel Hilbert spaces and discuss the idea of combining several kernels to improve the performance on certain tasks. We also provide a short cookbook of different kernels which are particularly useful for certain data-types such as images, graphs or speech segments.
Supervised Feature Selection via Dependence Estimation
Song, Le, Smola, Alex, Gretton, Arthur, Borgwardt, Karsten, Bedo, Justin
The task is to find a functional dependence between x and y, f: x null y, subject to certain optimality conditions. Representative tasks include binary classification, multi-class classification, regression and ranking. We often want to reduce the dimension of the data (the number of features) before the actual learning (Guyon & Elisseeff, 2003); a larger number of features can be associated with higher data collection cost, more difficulty in model interpretation, higher computational cost for the classifier, and decreased generalisationAppearing in Proceedings of the 24 th International Conference on Machine Learning, Corvallis, OR, 2007.