Statistical Learning
Cooled and Relaxed Survey Propagation for MRFs
Chieu, Hai L., Lee, Wee S., Teh, Yee W.
We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweightedvariant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches forIsing models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem.
Parallelizing Support Vector Machines on Distributed Computers
Zhu, Kaihua, Wang, Hao, Bai, Hongjie, Li, Jian, Qiu, Zhihuan, Cui, Hang, Chang, Edward Y.
Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let $n$ denote the number of training instances, $p$ the reduced matrix dimension after factorization ($p$ is significantly smaller than $n$), and $m$ the number of machines. PSVM reduces the memory requirement from $\MO$($n^2$) to $\MO$($np/m$), and improves computation time to $\MO$($np^2/m$). Empirical studies on up to $500$ computers shows PSVM to be effective.
A learning framework for nearest neighbor search
Cayton, Lawrence, Dasgupta, Sanjoy
Can we leverage learning techniques to build a fast nearest-neighbor (NN) retrieval data structure? We present a general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate. We explore the potential of this novel framework through two popular NN data structures: KD-trees and the rectilinear structures employed by locality sensitive hashing. We derive a generalization theory for these data structure classes and present simple learning algorithms for both. Experimental results reveal that learning often improves on the already strong performance of these data structures.
Discriminative Keyword Selection Using Support Vector Machines
Richardson, Fred, Campbell, William M.
Many tasks in speech processing involve classification of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique fordetermining these characteristics is to first convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating filter-wrappermethod that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and significant qualitative and quantitative results.
The Distribution Family of Similarity Distances
Burghouts, Gertjan, Smeulders, Arnold, Geusebroek, Jan-mark
Assessing similarity between features is a key step in object recognition and scene categorization tasks. We argue that knowledge on the distribution of distances generated by similarity functions is crucial in deciding whether features are similar or not. Intuitively one would expect that similarities between features could arise from any distribution. In this paper, we will derive the contrary, and report the theoretical result that $L_p$-norms --a class of commonly applied distance metrics-- from one feature vector to other vectors are Weibull-distributed if the feature values are correlated and non-identically distributed. Besides these assumptions being realistic for images, we experimentally show them to hold for various popular feature extraction algorithms, for a diverse range of images. This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms. Erratum: The authors of paper have declared that they have become convinced that the reasoning in the reference is too simple as a proof of their claims. As a consequence, they withdraw their theorems.
FilterBoost: Regression and Classification on Large Datasets
Bradley, Joseph K., Schapire, Robert E.
We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm, which is based on a logistic regression technique proposed by Collins, Schapire, & Singer, requires fewer assumptions to achieve bounds equivalent to or better than previous work. Moreover, we give the first proof that the algorithm of Collins et al. is a strong PAC learner, albeit within the filtering setting. Our proofs demonstrate the algorithm's strong theoretical properties forboth classification and conditional probability estimation, and we validate these results through extensive experiments. Empirically, our algorithm proves more robust to noise and overfitting than batch boosters in conditional probability estimation and proves competitive in classification.
The Tradeoffs of Large Scale Learning
Bottou, Léon, Bousquet, Olivier
This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation--estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways.
Multi-task Gaussian Process Prediction
Bonilla, Edwin V., Chai, Kian M., Williams, Christopher
In this paper we investigate multi-task learning in the context of Gaussian Processes (GP).We propose a model that learns a shared covariance function on input-dependent features and a "free-form" covariance matrix over tasks. This allows forgood flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption ofnoise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs.We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets.
Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
Blum, Ben, Baker, David, Jordan, Michael I., Bradley, Philip, Das, Rhiju, Kim, David E.
Rosetta is one of the leading algorithms for protein structure prediction today. It is a Monte Carlo energy minimization method requiring many random restarts to find structures with low energy. In this paper we present a resampling technique for structure prediction of small alpha/beta proteins using Rosetta. From an initial roundof Rosetta sampling, we learn properties of the energy landscape that guide a subsequent round of sampling toward lower-energy structures. Rather than attempt to fit the full energy landscape, we use feature selection methods--both L1-regularized linear regression and decision trees--to identify structural features that give rise to low energy. We then enrich these structural features in the second sampling round. Results are presented across a benchmark set of nine small alpha/beta proteinsdemonstrating that our methods seldom impair, and frequently improve, Rosetta's performance.