Genre
Safe and Efficient Screening For Sparse Support Vector Machine
Assume that X E Him" is a data set containing 71 samples, X: (x1, . . . Let w*()\) be the optimal solution of Eq. (1) All the features With nonzero values in "w" (A) are called active The Lagrangian multiplier [1] of the problem defined in Eq. (1) is: The Eq. (2) can be reformulated as: Since the problem defined in Eq. (1) is convex and the optimal value of the In the preceding equation i'j: ij, and Y is a diagonal matrix and YM: When the input is given, it can be obtained in a closed form. The Ll--regularized L2--Loss SVM in Eq. (1) can be rewritten in an uncon-- Eq. (22) shows that the necessary condition for a feature f to be active in the To bound value of 0Tf' 7 we need to first construct a closed convex set K that We first study how to construct the convex set K. In the following, we construct a closed convex set K based on Eq. (19) and The proof of this proposition can be found in [2]. Let 01 and 02 be the optimal solutions of the problem defined in Eq. (19) for Assume that /\1 A2, and 01 is known. In the preceding equations, 01, A1, and /\2 are known. Figure 1 shows an example of the K in a two dimensional space. And K is indicated by the shaded area. It is indicated by the shaded area. Besides the n dimensional hyperball defined in Eq. (32), it is possible to By applying Proposition 6.1 to the objective function defined in Eq. (33) for 01, Let t::--: Z 0. By substituting 0: 02 and 0: 01 into Eq. Eq. (35)7 respectively, and then combining the two obtained equations7 the As the value of t change from 0 to 007 Eq. (36) generates a series of hyperball. Eq. (36) reaches it minimum when, The theorem can be proved by minimizing the 7" defined in Eq. (36).
Para-active learning
Agarwal, Alekh, Bottou, Leon, Dudik, Miroslav, Langford, John
Training examples are not all equally informative. Active learning strategies leverage this observation in order to massively reduce the number of examples that need to be labeled. We leverage the same observation to build a generic strategy for parallelizing learning algorithms. This strategy is effective because the search for informative examples is highly parallelizable and because we show that its performance does not deteriorate when the sifting process relies on a slightly outdated model. Parallel active learning is particularly attractive to train nonlinear models with non-linear representations because there are few practical parallel learning algorithms for such models. We report preliminary experiments using both kernel SVMs and SGD-trained neural networks.
Necessary and Sufficient Conditions for Novel Word Detection in Separable Topic Models
Ding, Weicong, Ishwar, Prakash, Rohban, Mohammad H., Saligrama, Venkatesh
The simplicial condition and other stronger conditions that imply it have recently played a central role in developing polynomial time algorithms with provable asymptotic consistency and sample complexity guarantees for topic estimation in separable topic models. Of these algorithms, those that rely solely on the simplicial condition are impractical while the practical ones need stronger conditions. In this paper, we demonstrate, for the first time, that the simplicial condition is a fundamental, algorithm-independent, information-theoretic necessary condition for consistent separable topic estimation. Furthermore, under solely the simplicial condition, we present a practical quadratic-complexity algorithm based on random projections which consistently detects all novel words of all topics using only up to second-order empirical word moments. This algorithm is amenable to distributed implementation making it attractive for "big-data" scenarios involving a network of large distributed databases.
Online Ensemble Learning for Imbalanced Data Streams
While both cost-sensitive learning and online learning have been studied extensively, the effort in simultaneously dealing with these two issues is limited. Aiming at this challenge task, a novel learning framework is proposed in this paper. The key idea is based on the fusion of online ensemble algorithms and the state of the art batch mode cost-sensitive bagging/boosting algorithms. Within this framework, two separately developed research areas are bridged together, and a batch of theoretically sound online cost-sensitive bagging and online cost-sensitive boosting algorithms are first proposed. Unlike other online cost-sensitive learning algorithms lacking theoretical analysis of asymptotic properties, the convergence of the proposed algorithms is guaranteed under certain conditions, and the experimental evidence with benchmark data sets also validates the effectiveness and efficiency of the proposed methods.
Automatic Classification of Variable Stars in Catalogs with missing data
Pichara, Karim, Protopapas, Pavlos
We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks, a probabilistic graphical model, that allows us to perform inference to pre- dict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilises sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model we use three catalogs with missing data (SAGE, 2MASS and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches and at what computational cost. Integrating these catalogs with missing data we find that classification of variable objects improves by few percent and by 15% for quasar detection while keeping the computational cost the same.
A comparison of bandwidth selectors for mean shift clustering
Chacรณn, Josรฉ E., Monfort, Pablo
We explore the performance of several automatic bandwidth selectors, originally designed for density gradient estimation, as data-based procedures for nonparametric, modal clustering. The key tool to obtain a clustering from density gradient estimators is the mean shift algorithm, which allows to obtain a partition not only of the data sample, but also of the whole space. The results of our simulation study suggest that most of the methods considered here, like cross validation and plug in bandwidth selectors, are useful for cluster analysis via the mean shift algorithm. Keywords: bandwidth selection, mean shift algorithm, modal clustering.
Structured Optimal Transmission Control in Network-coded Two-way Relay Channels
Ding, Ni, Sadeghi, Parastoo, Kennedy, Rodney A.
This paper considers a transmission control problem in network-coded two-way relay channels (NC-TWRC), where the relay buffers random symbol arrivals from two users, and the channels are assumed to be fading. The problem is modeled by a discounted infinite horizon Markov decision process (MDP). The objective is to find a transmission control policy that minimizes the symbol delay, buffer overflow and transmission power consumption and error rate simultaneously and in the long run. By using the concepts of submodularity, multimodularity and L-natural convexity, we study the structure of the optimal policy searched by dynamic programming (DP) algorithm. We show that the optimal transmission policy is nondecreasing in queue occupancies or/and channel states under certain conditions such as the chosen values of parameters in the MDP model, channel modeling method, modulation scheme and the preservation of stochastic dominance in the transitions of system states. The results derived in this paper can be used to relieve the high complexity of DP and facilitate real-time control.
Trading USDCHF filtered by Gold dynamics via HMM coupling
We devise a USDCHF trading strategy using the dynamics of gold as a filter. Our strategy involves modelling both USDCHF and gold using a coupled hidden Markov model (CHMM). The observations will be indicators, RSI and CCI, which will be used as triggers for our trading signals. Upon decoding the model in each iteration, we can get the next most probable state and the next most probable observation. Hopefully by taking advantage of intermarket analysis and the Markov property implicit in the model, trading with these most probable values will produce profitable results.
Distributed Matrix Completion and Robust Factorization
Mackey, Lester, Talwalkar, Ameet, Jordan, Michael I.
If learning methods are to scale to the massive sizes of modern datasets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich theory but poor computational complexity and by the relative ease of mapping matrices onto distributed architectures, we introduce a scalable divide-and-conquer framework for noisy matrix factorization. We present a thorough theoretical analysis of this framework in which we characterize the statistical errors introduced by the "divide" step and control their magnitude in the "conquer" step, so that the overall algorithm enjoys high-probability estimation guarantees comparable to those of its base algorithm. We also present experiments in collaborative filtering and video background modeling that demonstrate the near-linear to superlinear speed-ups attainable with this approach.
Generalized Thompson Sampling for Contextual Bandits
Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sampling and exponentiated updates, we propose a new family of algorithms called Generalized Thompson Sampling in the expert-learning framework, which includes Thompson Sampling as a special case. Similar to most expert-learning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights. General regret bounds are derived, which are also instantiated to two important loss functions: square loss and logarithmic loss. In contrast to existing bounds, our results apply to quite general contextual bandits. More importantly, they quantify the effect of the "prior" distribution on the regret bounds.