Directed Networks
Relational Marginal Problems: Theory and Estimation
Kuลพelka, Ondลej (Cardiff University) | Wang, Yuyi (ETH Zurich) | Davis, Jesse (KU Leuven) | Schockaert, Steven (Cardiff University)
In the propositional setting, the marginal problem is to find a (maximum-entropy) distribution that has some given marginals. We study this problem in a relational setting and make the following contributions. First, we compare two different notions of relational marginals. Second, we show a duality between the resulting relational marginal problems and the maximum likelihood estimation of the parameters of relational models, which generalizes a well-known duality from the propositional setting. Third, by exploiting the relational marginal formulation, we present a statistically sound method to learn the parameters of relational models that will be applied in settings where the number of constants differs between the training and test data. Furthermore, based on a relational generalization of marginal polytopes, we characterize cases where the standard estimators based on feature's number of true groundings needs to be adjusted and we quantitatively characterize the consequences of these adjustments. Fourth, we prove bounds on expected errors of the estimated parameters, which allows us to lower-bound, among other things, the effective sample size of relational training data.
Planning and Learning for Decentralized MDPs With Event Driven Rewards
Gupta, Tarun (International Institute of Information Technology, Hyderabad) | Kumar, Akshat (Singapore Management University) | Paruchuri, Praveen (International Institute of Information Technology, Hyderabad)
Decentralized (PO)MDPs provide a rigorous framework for sequential multiagent decision making under uncertainty. However, their high computational complexity limits the practical impact. To address scalability and real-world impact, we focus on settings where a large number of agents primarily interact through complex joint-rewards that depend on their entire histories of states and actions. Such history-based rewards encapsulate the notion of events or tasks such that the team reward is given only when the joint-task is completed. Algorithmically, we contribute---1) A nonlinear programming (NLP) formulation for such event-based planning model; 2) A probabilistic inference based approach that scales much better than NLP solvers for a large number of agents; 3) A policy gradient based multiagent reinforcement learning approach that scales well even for exponential state-spaces. Our inference and RL-based advances enable us to solve a large real-world multiagent coverage problem modeling schedule coordination of agents in a real urban subway network where other approaches fail to scale.
Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization
Wang, Yijun (University of Science and Technology of China) | Xia, Yingce (University of Science and Technology of China) | Zhao, Li (Microsoft Research Asia) | Bian, Jiang (Microsoft Research Asia) | Qin, Tao (Microsoft Research Asia) | Liu, Guiquan (University of Science and Technology of China) | Liu, Tie-Yan (Microsoft Research Asia)
Neural machine translation (NMT) heavily relies on parallel bilingual data for training. Since large-scale, high-quality parallel corpora are usually costly to collect, it is appealing to exploit monolingual corpora to improve NMT. Inspired by the law of total probability, which connects the probability of a given target-side monolingual sentence to the conditional probability of translating from a source sentence to the target one, we propose to explicitly exploit this connection to learn from and regularize the training of NMT models using monolingual data. The key technical challenge of this approach is that there are exponentially many source sentences for a target monolingual sentence while computing the sum of the conditional probability given each possible source sentence. We address this challenge by leveraging the dual translation model (target-to-source translation) to sample several mostly likely source-side sentences and avoid enumerating all possible candidate source sentences. That is, we transfer the knowledge contained in the dual model to boost the training of the primal model (source-to-target translation), and we call such an approach dual transfer learning. Experiment results on English-French and German-English tasks demonstrate that dual transfer learning achieves significant improvement over several strong baselines and obtains new state-of-the-art results.
Geometric Relationship between Word and Context Representations
Feng, Jiangtao (Fudan University) | Zheng, Xiaoqing (Fudan University)
Pre-trained distributed word representations have been proven to be useful in various natural language processing (NLP) tasks. However, the geometric basis of word representations and their relations to the representations of word's contexts has not been carefully studied yet. In this study, we first investigate such geometric relationship under a general framework, which is abstracted from some typical word representation learning approaches, and find out that only the directions of word representations are well associated to their context vector representations while the magnitudes are not. In order to make better use of the information contained in the magnitudes of word representations, we propose a hierarchical Gaussian model combined with maximum a posteriori estimation to learn word representations, and extend it to represent polysemous words. Our word representations have been evaluated on multiple NLP tasks, and the experimental results show that the proposed model achieved promising results, comparing to several popular word representations.
IMS-DTM: Incremental Multi-Scale Dynamic Topic Models
Chen, Xilun (Arizona State University) | Candan, K. Selcuk (Arizona State University) | Sapino, Maria Luisa (University of Torino)
Dynamic topic models (DTM) are commonly used for mining latent topics in evolving web corpora. In this paper, we note that a major limitation of the conventional DTM based models is that they assume a predetermined and fixed scale of topics. In reality, however, topics may have varying spans and topics of multiple scales can co-exist in a single web or social media data stream. Therefore, DTMs that assume a fixed epoch length may not be able to effectively capture latent topics and thus negatively affect accuracy. In this paper, we propose a Multi-Scale Dynamic Topic Model (MS-DTM) and a complementary Incremental Multi-Scale Dynamic Topic Model (IMS-DTM) inference method that can be used to capture latent topics and their dynamics simultaneously, at different scales. In this model, topic specific feature distributions are generated based on a multi-scale feature distribution of the previous epochs; moreover, multiple scales of the current epoch are analyzed together through a novel multi-scale incremental Gibbs sampling technique. We show that the proposed model significantly improves efficiency and effectiveness compared to the single scale dynamic DTMs and prior models that consider only multiple scales of the past.
A Poisson Gamma Probabilistic Model for Latent Node-Group Memberships in Dynamic Networks
Yang, Sikun (Technische Universitรคt Darmstadt) | Koeppl, Heinz (Technische Universitรคt Darmstadt)
We present a probabilistic model for learning from dynamic relational data, wherein the observed interactions among networked nodes are modeled via the Bernoulli Poisson link function, and the underlying network structure are characterized by nonnegative latent node-group memberships, which are assumed to be gamma distributed. The latent memberships evolve according to Markov processes.The optimal number of latent groups can be determined by data itself. The computational complexity of our method scales with the number of non-zero links, which makes it scalable to large sparse dynamic relational data. We present batch and online Gibbs sampling algorithms to perform model inference. Finally, we demonstrate the model's performance on both synthetic and real-world datasets compared to state-of-the-art methods.
HodgeRank With Information Maximization for Crowdsourced Pairwise Ranking Aggregation
Xu, Qianqian (Institute of Information Engineering, CAS, Beijing) | Xiong, Jiechao (BICMR and School of Mathematical Sciences, Peking University, Beijing) | Chen, Xi (BICMR and School of Mathematical Sciences, Peking University, Beijing) | Huang, Qingming (Stern School of Business, New York University) | Yao, Yuan (University of Chinese Academy of Sciences, Beijing)
Recently, crowdsourcing has emerged as an effective paradigm for human-powered large scale problem solving in various domains. However, task requester usually has a limited amount of budget, thus it is desirable to have a policy to wisely allocate the budget to achieve better quality. In this paper, we study the principle of information maximization for active sampling strategies in the framework of HodgeRank, an approach based on Hodge Decomposition of pairwise ranking data with multiple workers. The principle exhibits two scenarios of active sampling: Fisher information maximization that leads to unsupervised sampling based on a sequential maximization of graph algebraic connectivity without considering labels; and Bayesian information maximization that selects samples with the largest information gain from prior to posterior, which gives a supervised sampling involving the labels collected. Experiments show that the proposed methods boost the sampling efficiency as compared to traditional sampling schemes and are thus valuable to practical crowdsourcing experiments.
Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching
Xie, Jianwen (University of California, Los Angeles) | Lu, Yang (University of California, Los Angeles) | Gao, Ruiqi (University of California, Los Angeles) | Wu, Ying Nian (University of California, Los Angeles)
This paper proposes a cooperative learning algorithm to train both the undirected energy-based model and the directed latent variable model jointly. The learning algorithm interweaves the maximum likelihood algorithms for learning the two models, and each iteration consists of the following two steps: (1) Modified contrastive divergence for energy-based model: The learning of the energy-based model is based on the contrastive divergence, but the finite-step MCMC sampling of the model is initialized from the synthesized examples generated by the latent variable model instead of being initialized from the observed examples. (2) MCMC teaching of the latent variable model: The learning of the latent variable model is based on how the MCMC in (1) changes the initial synthesized examples generated by the latent variable model, where the latent variables that generate the initial synthesized examples are known so that the learning is essentially supervised. Our experiments show that the cooperative learning algorithm can learn realistic models of images.
Bayesian Functional Optimization
Vien, Ngo Anh (Queen's University Belfast) | Zimmermann, Heiko (Univeristy of Stuttgart) | Toussaint, Marc (Univeristy of Stuttgart)
Bayesian optimization (BayesOpt) is a derivative-free approach for sequentially optimizing stochastic black-box functions. Standard BayesOpt, which has shown many successes in machine learning applications, assumes a finite dimensional domain which often is a parametric space. The parameter space is defined by the features used in the function approximations which are often selected manually. Therefore, the performance of BayesOpt inevitably depends on the quality of chosen features. This paper proposes a new Bayesian optimization framework that is able to optimize directly on the domain of function spaces. The resulting framework, Bayesian Functional Optimization (BFO), not only extends the application domains of BayesOpt to functional optimization problems but also relaxes the performance dependency on the chosen parameter space. We model the domain of functions as a reproducing kernel Hilbert space (RKHS), and use the notion of Gaussian processes on a real separable Hilbert space. As a result, we are able to define traditional improvement-based (PI and EI) and optimistic acquisition functions (UCB) as functionals. We propose to optimize the acquisition functionals using analytic functional gradients that are also proved to be functions in a RKHS. We evaluate BFO in three typical functional optimization tasks: i) a synthetic functional optimization problem, ii) optimizing activation functions for a multi-layer perceptron neural network, and iii) a reinforcement learning task whose policies are modeled in RKHS.
Mixed Sum-Product Networks: A Deep Architecture for Hybrid Domains
Molina, Alejandro (TU Dortmund) | Vergari, Antonio (University of Bari) | Mauro, Nicola Di (University of Bari) | Natarajan, Sriraam (Indiana University) | Esposito, Floriana (University of Bari) | Kersting, Kristian (TU Darmstadt)
While all kinds of mixed data---from personal data, over panel and scientific data, to public and commercial data---are collected and stored, building probabilistic graphical models for these hybrid domains becomes more difficult. Users spend significant amounts of time in identifying the parametric form of the random variables (Gaussian, Poisson, Logit, etc.) involved and learning the mixed models. To make this difficult task easier, we propose the first trainable probabilistic deep architecture for hybrid domains that features tractable queries. It is based on Sum-Product Networks (SPNs) with piecewise polynomial leaf distributions together with novel nonparametric decomposition and conditioning steps using the Hirschfeld-Gebelein-Renyi Maximum Correlation Coefficient. This relieves the user from deciding a-priori the parametric form of the random variables but is still expressive enough to effectively approximate any distribution and permits efficient learning and inference.Our experiments show that the architecture, called Mixed SPNs, can indeed capture complex distributions across a wide range of hybrid domains.