Industry
iLSTD: Eligibility Traces and Convergence Analysis
Geramifard, Alborz, Bowling, Michael, Zinkevich, Martin, Sutton, Richard S.
In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n 10, 000) to show substantial computational advantages over LSTD.
Multiple Instance Learning for Computer Aided Diagnosis
Dundar, Murat, Krishnapuram, Balaji, Rao, R. B., Fung, Glenn M.
Many computer aided diagnosis (CAD) problems can be best modelled as a multiple-instance learning (MIL) problem with unbalanced data: i.e., the training data typically consists of a few positive bags, and a very large number of negative instances. Existing MIL algorithms are much too computationally expensive for these datasets. We describe CH, a framework for learning a Convex Hull representation of multiple instances that is significantly faster than existing MIL algorithms. Our CH framework applies to any standard hyperplane-based learning algorithm, and for some algorithms, is guaranteed to find the global optimal solution. Experimental studies on two different CAD applications further demonstrate that the proposed algorithm significantly improves diagnostic accuracy when compared to both MIL and traditional classifiers. Although not designed for standard MIL problems (which have both positive and negative bags and relatively balanced datasets), comparisons against other MIL methods on benchmark problems also indicate that the proposed method is competitive with the state-of-the-art.
Optimal Single-Class Classification Strategies
El-Yaniv, Ran, Nisenson, Mordechai
We consider single-class classification (SCC) as a two-person game between the learner and an adversary. In this game the target distribution is completely known to the learner and the learner's goal is to construct a classifier capable of guaranteeing a given tolerance for the false-positive error while minimizing the false negative error. We identify both "hard" and "soft" optimal classification strategies for different types of games and demonstrate that soft classification can provide a significant advantage. Our optimal strategies and bounds provide worst-case lower bounds for standard, finite-sample SCC and also motivate new approaches to solving SCC.
A Theory of Retinal Population Coding
Doi, Eizaburo, Lewicki, Michael S.
Efficient coding models predict that the optimal code for natural images is a population of oriented Gabor receptive fields. These results match response properties of neurons in primary visual cortex, but not those in the retina. Does the retina use an optimal code, and if so, what is it optimized for? Previous theories of retinal coding have assumed that the goal is to encode the maximal amount of information about the sensory signal. However, the image sampled by retinal photoreceptors is degraded both by the optics of the eye and by the photoreceptor noise. Therefore, de-blurring and de-noising of the retinal signal should be important aspects of retinal coding.
Support Vector Machines on a Budget
The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results.
Context dependent amplification of both rate and event-correlation in a VLSI network of spiking neurons
Chicca, Elisabetta, Indiveri, Giacomo, Douglas, Rodney J.
Cooperative competitive networks are believed to play a central role in cortical processing and have been shown to exhibit a wide set of useful computational properties. We propose a VLSI implementation of a spiking cooperative competitive network and show how it can perform context dependent computation both in the mean firing rate domain and in spike timing correlation space. In the mean rate case the network amplifies the activity of neurons belonging to the selected stimulus and suppresses the activity of neurons receiving weaker stimuli. In the event correlation case, the recurrent network amplifies with a higher gain the correlation between neurons which receive highly correlated inputs while leaving the mean firing rate unaltered. We describe the network architecture and present experimental data demonstrating its context dependent computation capabilities.
implicit Online Learning with Kernels
Cheng, Li, Schuurmans, Dale, Wang, Shaojun, Caelli, Terry, Vishwanathan, S.v.n.
Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data.
Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Chemudugunta, Chaitanya, Smyth, Padhraic, Steyvers, Mark
Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF).
Max-margin classification of incomplete data
Chechik, Gal, Heitz, Geremy, Elidan, Gal, Abbeel, Pieter, Koller, Daphne
We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired objective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex objective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have nontrivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images.
Greedy Layer-Wise Training of Deep Networks
Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, Larochelle, Hugo
Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly nonlinear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.