Goto

Collaborating Authors

 Country


A Spectral Algorithm for Learning Hidden Markov Models

arXiv.org Artificial Intelligence

Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations---it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple, employing only a singular value decomposition and matrix multiplications.


Training Restricted Boltzmann Machines on Word Observations

arXiv.org Machine Learning

The restricted Boltzmann machine (RBM) is a flexible tool for modeling complex data, however there have been significant computational difficulties in using RBMs to model high-dimensional multinomial observations. In natural language processing applications, words are naturally modeled by K-ary discrete distributions, where K is determined by the vocabulary size and can easily be in the hundreds of thousands. The conventional approach to training RBMs on word observations is limited because it requires sampling the states of K-way softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue by employing a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K. We demonstrate the success of our approach by training RBMs on hundreds of millions of word n-grams using larger vocabularies than previously feasible and using the learned features to improve performance on chunking and sentiment classification tasks, achieving state-of-the-art results on the latter.


Higher-Order Partial Least Squares (HOPLS): A Generalized Multi-Linear Regression Method

arXiv.org Artificial Intelligence

A new generalized multilinear regression model, termed the Higher-Order Partial Least Squares (HOPLS), is introduced with the aim to predict a tensor (multiway array) $\tensor{Y}$ from a tensor $\tensor{X}$ through projecting the data onto the latent space and performing regression on the corresponding latent variables. HOPLS differs substantially from other regression models in that it explains the data by a sum of orthogonal Tucker tensors, while the number of orthogonal loadings serves as a parameter to control model complexity and prevent overfitting. The low dimensional latent space is optimized sequentially via a deflation operation, yielding the best joint subspace approximation for both $\tensor{X}$ and $\tensor{Y}$. Instead of decomposing $\tensor{X}$ and $\tensor{Y}$ individually, higher order singular value decomposition on a newly defined generalized cross-covariance tensor is employed to optimize the orthogonal loadings. A systematic comparison on both synthetic data and real-world decoding of 3D movement trajectories from electrocorticogram (ECoG) signals demonstrate the advantages of HOPLS over the existing methods in terms of better predictive ability, suitability to handle small sample sizes, and robustness to noise.


Super-Mixed Multiple Attribute Group Decision Making Method Based on Hybrid Fuzzy Grey Relation Approach Degree

arXiv.org Artificial Intelligence

The feature of our method different from other fuzzy grey relation method for supermixed multiple attribute group decision-making is that all of the subjective and objective weights are obtained by interval grey number and that the group decisionmaking is performed based on the relative approach degree of grey TOPSIS, the relative approach degree of grey incidence and the relative membership degree of grey incidence using 4-dimensional Euclidean distance. The weighted Borda method is used to obtain final rank by using the results of four methods. An example shows the applicability of the proposed approach.


On-the-fly Macros

arXiv.org Artificial Intelligence

We present a domain-independent algorithm that computes macros in a novel way. Our algorithm computes macros "on-the-fly" for a given set of states and does not require previously learned or inferred information, nor prior domain knowledge. The algorithm is used to define new domain-independent tractable classes of classical planning that are proved to include \emph{Blocksworld-arm} and \emph{Towers of Hanoi}.


On the Detection of Concept Changes in Time-Varying Data Stream by Testing Exchangeability

arXiv.org Machine Learning

A martingale framework for concept change detection based on testing data exchangeability was recently proposed (Ho, 2005). In this paper, we describe the proposed change-detection test based on the Doob's Maximal Inequality and show that it is an approximation of the sequential probability ratio test (SPRT). The relationship between the threshold value used in the proposed test and its size and power is deduced from the approximation. The mean delay time before a change is detected is estimated using the average sample number of a SPRT. The performance of the test using various threshold values is examined on five different data stream scenarios simulated using two synthetic data sets. Finally, experimental results show that the test is effective in detecting changes in time-varying data streams simulated using three benchmark data sets.


Two-Way Latent Grouping Model for User Preference Prediction

arXiv.org Machine Learning

We introduce a novel latent grouping model for predicting the relevance of a new document to a user. The model assumes a latent group structure for both users and documents. We compared the model against a state-of-the-art method, the User Rating Profile model, where only users have a latent group structure. We estimate both models by Gibbs sampling. The new method predicts relevance more accurately for new documents that have few known ratings. The reason is that generalization over documents then becomes necessary and hence the twoway grouping is profitable.


Maximum Margin Bayesian Networks

arXiv.org Machine Learning

We consider the problem of learning Bayesian network classifiers that maximize the marginover a set of classification variables. We find that this problem is harder for Bayesian networks than for undirected graphical models like maximum margin Markov networks. The main difficulty is that the parameters in a Bayesian network must satisfy additional normalization constraints that an undirected graphical model need not respect. These additional constraints complicate the optimization task. Nevertheless, we derive an effective training algorithm that solves the maximum margin training problem for a range of Bayesian network topologies, and converges to an approximate solution for arbitrary network topologies. Experimental results show that the method can demonstrate improved generalization performance over Markov networks when the directed graphical structure encodes relevant knowledge. In practice, the training technique allows one to combine prior knowledge expressed as a directed (causal) model with state of the art discriminative learning methods.


PAC-Bayesian Majority Vote for Late Classifier Fusion

arXiv.org Machine Learning

A lot of attention has been devoted to multimedia indexing over the past few years. In the literature, we often consider two kinds of fusion schemes: The early fusion and the late fusion. In this paper we focus on late classifier fusion, where one combines the scores of each modality at the decision level. To tackle this problem, we investigate a recent and elegant well-founded quadratic program named MinCq coming from the Machine Learning PAC-Bayes theory. MinCq looks for the weighted combination, over a set of real-valued functions seen as voters, leading to the lowest misclassification rate, while making use of the voters' diversity. We provide evidence that this method is naturally adapted to late fusion procedure. We propose an extension of MinCq by adding an order- preserving pairwise loss for ranking, helping to improve Mean Averaged Precision measure. We confirm the good behavior of the MinCq-based fusion approaches with experiments on a real image benchmark.


Learning Factor Graphs in Polynomial Time & Sample Complexity

arXiv.org Machine Learning

We study computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded factor size and bounded connectivity can be learned in polynomial time and polynomial number of samples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Unlike maximum likelihood estimation, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks.