Asia
Multi-Task Learning via Conic Programming
Kato, Tsuyoshi, Kashima, Hisashi, Sugiyama, Masashi, Asai, Kiyoshi
When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as \emph{uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have significantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated through simulations with protein super-family classification and ordinal regression problems.
Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
Globerson, Amir, Jaakkola, Tommi S.
We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches.
Incremental Natural Actor-Critic Algorithms
Bhatnagar, Shalabh, Ghavamzadeh, Mohammad, Lee, Mark, Sutton, Richard S.
We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learningmethods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility withfunction approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variancein some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learningin the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.
Collapsed Variational Inference for HDP
Teh, Yee W., Kurihara, Kenichi, Welling, Max
A wide variety of Dirichlet-multinomial'topic' models have found interesting applications inrecent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and sidestepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing thetechnique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables.
What makes some POMDP problems easy to approximate?
Lee, Wee S., Rong, Nan, Hsu, David
Point-based algorithms have been surprisingly successful in computing approximately optimalsolutions for partially observable Markov decision processes (POMDPs) in high dimensional belief spaces. In this work, we seek to understand the belief-space properties that allow some POMDP problems to be approximated efficiently and thus help to explain the point-based algorithms' success often observed inthe experiments. We show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of a reachable belief space, which is the subset of the belief space reachable from a given belief point. We also show that under the weaker condition of having a small covering number for an optimal reachable space, which is the subset of the belief space reachable under an optimal policy, computing an approximately optimal solution is NPhard. However, given a suitable set of points that "cover" an optimal reachable spacewell, an approximate solution can be computed in polynomial time. The covering number highlights several interesting properties that reduce the complexity ofPOMDP planning in practice, e.g., fully observed state variables, beliefs with sparse support, smooth beliefs, and circulant state-transition matrices.
Kernel Measures of Conditional Dependence
Fukumizu, Kenji, Gretton, Arthur, Sun, Xiaohai, Schรถlkopf, Bernhard
We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend onthe choice of kernel in the limit of infinite data, for a wide class of kernels. Atthe same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments.
HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
We present a novel paradigm for statistical machine translation (SMT), based on joint modeling of word alignment and the topical aspects underlying bilingual document pairs via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this new paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of matching words between languages, during likelihood-based training of topic-dependent translational lexicons, as well as topic representations in each language. The resulting trained HM-BiTAM can not only display topic patterns like other methods such as LDA, but now for bilingual corpora; it also offers a principled way of inferring optimal translation in a context-dependent way. Our method integrates the conventional IBM Models based on HMM --- a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model, and we report an extensive empirical analysis (in many way complementary to the description-oriented of our method in three aspects: word alignment, bilingual topic representation, and translation.
Estimating disparity with confidence from energy neurons
Tsang, Eric K., Shi, Bertram E.
Binocular fusion takes place over a limited region smaller than one degree of visual angle (Panum's fusional area), which is on the order of the range of preferred disparities measured in populations of disparity-tuned neurons in the visual cortex. However, the actual range of binocular disparities encountered in natural scenes ranges over tens of degrees. This discrepancy suggests that there must be a mechanism for detecting whether the stimulus disparity is either inside or outside of the range of the preferred disparities in the population. Here, we present a statistical framework to derive feature in a population of V1 disparity neuron to determine the stimulus disparity within the preferred disparity range of the neural population. When optimized for natural images, it yields a feature that can be explained by the normalization which is a common model in V1 neurons. We further makes use of the feature to estimate the disparity in natural images. Our proposed model generates more correct estimates than coarse-to-fine multiple scales approaches and it can also identify regions with occlusion. The approach suggests another critical role for normalization in robust disparity estimation.