dependency network
Deriving Language Models from Masked Language Models
Hennigen, Lucas Torroba, Kim, Yoon
Masked language models (MLM) do not explicitly define a distribution over language, i.e., they are not language models per se. However, recent work has implicitly treated them as such for the purposes of generation and scoring. This paper studies methods for deriving explicit joint distributions from MLMs, focusing on distributions over two tokens, which makes it possible to calculate exact distributional properties. We find that an approach based on identifying joints whose conditionals are closest to those of the MLM works well and outperforms existing Markov random field-based approaches. We further find that this derived model's conditionals can even occasionally outperform the original MLM's conditionals.
DiscoVars: A New Data Analysis Perspective -- Application in Variable Selection for Clustering
We present a new data analysis perspective to determine variable importance regardless of the underlying learning task. Traditionally, variable selection is considered an important step in supervised learning for both classification and regression problems. The variable selection also becomes critical when costs associated with the data collection and storage are considerably high for cases like remote sensing. Therefore, we propose a new methodology to select important variables from the data by first creating dependency networks among all variables and then ranking them (i.e. nodes) by graph centrality measures. Selecting Top-$n$ variables according to preferred centrality measure will yield a strong candidate subset of variables for further learning tasks e.g. clustering. We present our tool as a Shiny app which is a user-friendly interface development environment. We also extend the user interface for two well-known unsupervised variable selection methods from literature for comparison reasons.
Reconsidering Dependency Networks from an Information Geometry Perspective
Takabatake, Kazuya, Akaho, Shotaro
Dependency networks (Heckerman et al., 2000) are potential probabilistic graphical models for systems comprising a large number of variables. Like Bayesian networks, the structure of a dependency network is represented by a directed graph, and each node has a conditional probability table. Learning and inference are realized locally on individual nodes; therefore, computation remains tractable even with a large number of variables. However, the dependency network's learned distribution is the stationary distribution of a Markov chain called pseudo-Gibbs sampling and has no closed-form expressions. This technical disadvantage has impeded the development of dependency networks. In this paper, we consider a certain manifold for each node. Then, we can interpret pseudo-Gibbs sampling as iterative m-projections onto these manifolds. This interpretation provides a theoretical bound for the location where the stationary distribution of pseudo-Gibbs sampling exists in distribution space. Furthermore, this interpretation involves structure and parameter learning algorithms as optimization problems. In addition, we compare dependency and Bayesian networks experimentally. The results demonstrate that the dependency network and the Bayesian network have roughly the same performance in terms of the accuracy of their learned distributions. The results also show that the dependency network can learn much faster than the Bayesian network.
VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data
Ma, Chao, Tschiatschek, Sebastian, Hernรกndez-Lobato, Josรฉ Miguel, Turner, Richard, Zhang, Cheng
Deep generative models often perform poorly in real-world applications due to the heterogeneity of natural data sets. Heterogeneity arises from data containing different types of features (categorical, ordinal, continuous, etc.) and features of the same type having different marginal distributions. We propose an extension of variational autoencoders (VAEs) called VAEM to handle such heterogeneous data. VAEM is a deep generative model that is trained in a two stage manner such that the first stage provides a more uniform representation of the data to the second stage, thereby sidestepping the problems caused by heterogeneous data. We provide extensions of VAEM to handle partially observed data, and demonstrate its performance in data generation, missing data prediction and sequential feature selection tasks. Our results show that VAEM broadens the range of real-world applications where deep generative models can be successfully deployed.
Predicting Program Properties from 'Big Code'
We present a new approach for predicting program properties from large codebases (aka "Big Code"). Our approach learns a probabilistic model from "Big Code" and uses this model to predict properties of new, unseen programs. The key idea of our work is to transform the program into a representation that allows us to formulate the problem of inferring program properties as structured prediction in machine learning. This enables us to leverage powerful probabilistic models such as Conditional Random Fields (CRFs) and perform joint prediction of program properties. As an example of our approach, we built a scalable prediction engine called JSNICE for solving two kinds of tasks in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of cases. Since its public release at http://jsnice.org, JSNice has become a popular system with hundreds of thousands of uses. By formulating the problem of inferring program properties as structured prediction, our work opens up the possibility for a range of new "Big Code" applications such as de-obfuscators, decompilers, invariant generators, and others. Recent years have seen significant progress in the area of programming languages driven by advances in type systems, constraint solving, program analysis, and synthesis techniques. Fundamentally, these methods reason about each program in isolation and while powerful, the effectiveness of programming tools based on these techniques is approaching its inherent limits. Thus, a more disruptive change is needed if a significant improvement is to take place. At the same time, creating probabilistic models from large datasets (also called "Big Data") has transformed a number of areas such as natural language processing, computer vision, recommendation systems, and many others. However, despite the overwhelming success of "Big Data" in a variety of application domains, learning from large datasets of programs has previously not had tangible impact on programming tools. Yet, with the tremendous growth of publicly available source code in repositories such as GitHub4 and BitBucket2 (referred to as "Big Code" by a recent DARPA initiative11) comes the opportunity to create new kinds of programming tools based on probabilistic models of such data.
Core Dependency Networks
Molina, Alejandro (TU Dortmund) | Munteanu, Alexander (TU Dortmund) | Kersting, Kristian (TU Darmstadt)
Many applications infer the structure of a probabilistic graphical model from data to elucidate the relationships between variables. But how can we train graphical models on a massive data set? In this paper, we show how to construct coresets---compressed data sets which can be used as proxy for the original data and have provably bounded worst case error---for Gaussian dependency networks (DNs), i.e., cyclic directed graphical models over Gaussians, where the parents of each variable are its Markov blanket. Specifically, we prove that Gaussian DNs admit coresets of size independent of the size of the data set. Unfortunately, this does not extend to DNs over members of the exponential family in general. As we will prove, Poisson DNs do not admit small coresets. Despite this worst-case result, we will provide an argument why our coreset construction for DNs can still work well in practice on count data.To corroborate our theoretical results, we empirically evaluated the resulting Core DNs on real data sets. The results demonstrate significant gains over no or naive sub-sampling, even in the case of count data.
VT: An Expert Elevator Designer That Uses Knowledge-Based Backtracking
Even least commitment systems such as MOLGEN (Stefik 1981a, 1981b) are sometimes forced to guess. In the course of designing genetics experiments, MOL-GEN tries to avoid making a decision until all constraints that might affect the decision are known. In some cases, this postponement is not possible, and the system becomes stuck; none of the pending decisions can be made with complete confidence. In such a case, a decision based on partial information is needed, and such a decision might be wrong. In this case, a problem solver needs the ability either to backtrack to correct bad decisions or to maintain parallel solutions corresponding to the alternatives at the stuck decision point However, if alternative guesses exist at each point, and there are many such decision points on each solution path, a commitment to examine every possible combination of alternatives proves unwieldy.
Coresets for Dependency Networks
Molina, Alejandro, Munteanu, Alexander, Kersting, Kristian
But how can we train graphical models on a massive data set? In this paper, we show how to construct coresets--compressed data sets which can be used as proxy for the original data and have provably bounded worst case error--for Gaussian dependency networks (DNs), i.e., cyclic directed graphical models over Gaussians, where the parents of each variable are its Markov blanket. Specifically, we prove that Gaussian DNs admit coresets of size independent of the size of the data set. Unfortunately, this does not extend to DNs over members of the exponential family in general. As we will prove, Poisson DNs do not admit small coresets. Despite this worst-case result, we will provide an argument why our coreset construction for DNs can still work well in practice on count data. To corroborate our theoretical results, we empirically evaluated the resulting Core DNs on real data sets. The results demonstrate significant gains over no or naive sub-sampling, even in the case of count data.