Statistical Learning
Divergences, surrogate loss functions and experimental design
Nguyen, XuanLong, Wainwright, Martin J., Jordan, Michael I.
In this paper, we provide a general theorem that establishes a correspondence betweensurrogate loss functions in classification and the family of f-divergences. Moreover, we provide constructive procedures for determining the f-divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f-divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f-divergences, and provide necessary andsufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component ofexperiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements.
Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
Navot, Amir, Shpigelman, Lavi, Tishby, Naftali, Vaadia, Eilon
We present a nonlinear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function onits input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problemsand use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selectionwe are able to improve prediction quality and suggest a novel way of exploring neural data.
Q-Clustering
Narasimhan, Mukund, Jojic, Nebojsa, Bilmes, Jeff A.
We show that Queyranne's algorithm for minimizing symmetric submodular functionscan be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion triesto maximize the minimum distance between elements of different clusters,and is inherently "discriminative". It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.
Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity
Hild, Kenneth, Sekihara, Kensuke, Attias, Hagai T., Nagarajan, Srikantan S.
This paper presents a novel technique for analyzing electromagnetic imaging data obtained using the stimulus evoked experimental paradigm. The technique is based on a probabilistic graphical model, which describes thedata in terms of underlying evoked and interference sources, and explicitly models the stimulus evoked paradigm.
Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Nadler, Boaz, Lafon, Stephane, Kevrekidis, Ioannis, Coifman, Ronald R.
This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrixof all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal undera certain mean squared error criterion.
Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
Lozano, Aurelie C., Kulkarni, Sanjeev R., Schapire, Robert E.
We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed(i.i.d.) but come from empirical processes of stationary ฮฒ-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achievedby restricting the 1-norm of the base classifiers' weights. When compared to the i.i.d.
Radial Basis Function Network for Multi-task Learning
We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learningalgorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network's generalization totest data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions.
Fusion of Similarity Data in Clustering
Lange, Tilman, Buhmann, Joachim M.
Fusing multiple information sources can yield significant benefits to successfully accomplishlearning tasks. Many studies have focussed on fusing information in supervised learning contexts. We present an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a nonnegative matrix factorization problem of a mixture ofsimilarity measurements. The tradeoff between the informativeness ofdata sources and the sparseness of their mixture is controlled by an entropy-based weighting mechanism. For the purpose of model selection, astability-based approach is employed to ensure the selection of the most self-consistent hypothesis. The experiments demonstrate the performance of the method on toy as well as real world data sets.
Rodeo: Sparse Nonparametric Regression in High Dimensions
Wasserman, Larry, Lafferty, John D.
We present a method for nonparametric regression that performs bandwidth selectionand variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions wherethe gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rateof convergence, up to logarithmic factors, as if the relevant variables wereknown in advance. The method--called rodeo (regularization of derivative expectation operator)--conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems.