Country
The reproducing Stein kernel approach for post-hoc corrected sampling
Hodgkinson, Liam, Salomone, Robert, Roosta, Fred
The reproducing Stein kernel approach for post-hoc corrected sampling Liam Hodgkinson 1,, Robert Salomone 2,, and Fred Roosta 3,, † 1 Department of Statistics, UC Berkeley, Berkeley, CA, 94720, USA. Abstract: Stein importance sampling [42] is a widely applicable technique based on kernelized Stein discrepancy [43], which corrects the output of approximate sampling algorithms by reweighting the empirical distribution of the samples. A general analysis of this technique is conducted for the previously unconsidered setting where samples are obtained via the simulation of a Markov chain, and applies to an arbitrary underlying Polish space. We prove that Stein importance sampling yields consistent estimators for quantities related to a target distribution of interest by using samples obtained from a geometrically ergodic Markov chain with a possibly unknown invariant measure that differs from the desired target. The approach is shown to be valid under conditions that are satisfied for a large number of unadjusted samplers, and is capable of retaining consistency when data subsampling is used. Along the way, a universal theory of reproducing Stein kernels is established, which enables the construction of kernelized Stein discrepancy on general Polish spaces, and provides sufficient conditions for kernels to be convergence-determining on such spaces. These results are of independent interest for the development of future methodology based on kernelized Stein discrepancies. 1. Introduction Our problem of interest is the efficient computation of integrals with respect to some target probability measure π . Adopting the Monte Carlo approach, π is approximated by an empirical distribution formed from samples drawn according to π . However, in many problems of interest, it is not possible to simulate according to π exactly, and so further approximate methods must be used. Arguably the most widely employed and general approach is Markov Chain Monte Carlo (MCMC); successively drawing samples as a realization of a Markov chain. The dominant approach to MCMC involves the simulation of a process that is π -ergodic, often constructed by the Metropolis-Hastings algorithm from an underlying irreducible and aperiodic Markov chain [58]. However, there has been significant recent interest in so-called unadjusted MCMC approaches [14, 19, 29, 45]. A common strategy with these methods is the approximate numer-All authors are supported in part by the Australian Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS), under Australian Research Council grant CE140100049. For the same computational effort, one can achieve substantially lower variance of estimates at the cost of incurring additional (asymptotic) bias. Despite poorer asymptotic guarantees [21], the ensuing Markov chains are often rapidly mixing, and perform particularly well in high dimensional settings [20].
Introduction of Quantification in Frame Semantics
Feature Structures (FSs) are a widespread tool used for decompositional frameworks of Attribute-Value associations. Even though they thrive in simple systems, they lack a way of representing higher-order entities and relations. This is however needed in Frame Semantics, where semantic dependencies should be able to connect groups of individuals and their properties, especially to model quantification. To answer this issue, this master report introduces wrappings as a way to envelop a sub-FS and treat it as a node. Following the work of [Kallmeyer, Osswald 2013], we extend its syntax, semantics and some properties (translation to FOL, subsumption, unification). We can then expand the proposed pipeline. Lexical minimal model sets are generated from formulas. They unify by FS value equations obtained by LTAG parsing to an underspecified sentence representation. The syntactic approach of quantifiers allows us to use existing methods to produce any possible reading. Finally, we give a transcription to type-logical formulas to interact with the context in the view of dynamic semantics. Supported by ideas of Frame Types, this system provides a workable and tractable tool for higher-order relations with FS.
Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings
Melas-Kyriazi, Luke, Han, George, Liang, Celine
Over the past year, the emergence of transfer learning with large-scale language models (LM) has led to dramatic performance improvements across a broad range of natural language understanding tasks. However, the size and memory footprint of these large LMs makes them difficult to deploy in many scenarios (e.g. on mobile phones). Recent research points to knowledge distillation as a potential solution, showing that when training data for a given task is abundant, it is possible to distill a large (teacher) LM into a small task-specific (student) network with minimal loss of performance. However, when such data is scarce, there remains a significant performance gap between large pretrained LMs and smaller task-specific models, even when training via distillation. In this paper, we bridge this gap with a novel training approach, called generation-distillation, that leverages large finetuned LMs in two ways: (1) to generate new (unlabeled) training examples, and (2) to distill their knowledge into a small network using these examples. Across three low-resource text classification datsets, we achieve comparable performance to BERT while using 300 fewer parameters, and we outperform prior approaches to distillation for text classification while using 3 fewer parameters. 1 Introduction Over the past year, rapid progress in unsupervised language representation learning has led to the development of increasingly powerful and gener-alizable language models (Radford et al., 2019; Devlin et al., 2018).
Inference in Multi-Layer Networks with Matrix-Valued Unknowns
Pandit, Parthe, Sahraee-Ardakan, Mojtaba, Rangan, Sundeep, Schniter, Philip, Fletcher, Alyson K.
We consider the problem of inferring the input and hidden variables of a stochastic multi-layer neural network from an observation of the output. The hidden variables in each layer are represented as matrices. This problem applies to signal recovery via deep generative prior models, multi-task and mixed regression and learning certain classes of two-layer neural networks. A unified approximation algorithm for both MAP and MMSE inference is proposed by extending a recently-developed Multi-Layer Vector Approximate Message Passing (ML-VAMP) algorithm to handle matrix-valued unknowns. It is shown that the performance of the proposed Multi-Layer Matrix VAMP (ML-Mat-VAMP) algorithm can be exactly predicted in a certain random large-system limit, where the dimensions $N\times d$ of the unknown quantities grow as $N\rightarrow\infty$ with $d$ fixed. In the two-layer neural-network learning problem, this scaling corresponds to the case where the number of input features and training samples grow to infinity but the number of hidden nodes stays fixed. The analysis enables a precise prediction of the parameter and test error of the learning.
Regime Switching Bandits
Zhou, Xiang, Chen, Ningyuan, Gao, Xuefeng, Xiong, Yi
We study a multi-armed bandit problem where the rewards exhibit regime-switching. Specifically, the distributions of the random rewards generated from all arms depend on a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the unknown transition probability matrix as well as the reward distribution. We propose an efficient learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models and upper confidence bound methods for reinforcement learning. We also establish $O(T^{2/3}\sqrt{\log T})$ bound on the regret of the proposed learning algorithm where $T$ is the unknown horizon. Finally, we conduct numerical experiments to illustrate the effectiveness of the learning algorithm.
GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation
Shi, Chence, Xu, Minkai, Zhu, Zhaocheng, Zhang, Weinan, Zhang, Ming, Tang, Jian
Molecular graph generation is a fundamental problem for drug discovery and has been attracting growing attention. The problem is challenging since it requires not only generating chemically valid molecular structures but also optimizing their chemical properties in the meantime. Inspired by the recent progress in deep generative models, in this paper we propose a flow-based autoregressive model for graph generation called GraphAF. GraphAF combines the advantages of both autoregressive and flow-based approaches and enjoys: (1) high model flexibility for data density estimation; (2) efficient parallel computation for training; (3) an iterative sampling process, which allows leveraging chemical domain knowledge for valency checking. Experimental results show that GraphAF is able to generate 68% chemically valid molecules even without chemical knowledge rules and 100% valid molecules with chemical rules. The training process of GraphAF is two times faster than the existing state-of-the-art approach GCPN. After fine-tuning the model for goal-directed property optimization with reinforcement learning, GraphAF achieves state-of-the-art performance on both chemical property optimization and constrained property optimization.
Particle-Gibbs Sampling For Bayesian Feature Allocation Models
Bouchard-Côté, Alexandre, Roth, Andrew
Bayesian feature allocation models are a popular tool for modelling data with a combinatorial latent structure. Exact inference in these models is generally intractable and so practitioners typically apply Markov Chain Monte Carlo (MCMC) methods for posterior inference. The most widely used MCMC strategies rely on an element wise Gibbs update of the feature allocation matrix. These element wise updates can be inefficient as features are typically strongly correlated. To overcome this problem we have developed a Gibbs sampler that can update an entire row of the feature allocation matrix in a single move. However, this sampler is impractical for models with a large number of features as the computational complexity scales exponentially in the number of features. We develop a Particle Gibbs sampler that targets the same distribution as the row wise Gibbs updates, but has computational complexity that only grows linearly in the number of features. We compare the performance of our proposed methods to the standard Gibbs sampler using synthetic data from a range of feature allocation models. Our results suggest that row wise updates using the PG methodology can significantly improve the performance of samplers for feature allocation models.
Robust Submodular Minimization with Applications to Cooperative Modeling
Robust Optimization is becoming increasingly important in machine learning applications. This paper studies the problem of robust submodular minimization subject to combinatorial constraints. Constrained Submodular Minimization arises in several applications such as co-operative cuts in image segmentation, co-operative matchings in image correspondence, etc. Many of these models are defined over clusterings of data points (for example pixels in images), and it is important for these models to be robust to perturbations and uncertainty in the data. While several existing papers have studied robust submodular maximization, ours is the first work to study the minimization version under a broad range of combinatorial constraints including cardinality, knapsack, matroid as well as graph-based constraints such as cuts, paths, matchings, and trees. In each case, we provide scalable approximation algorithms and also study hardness bounds. Finally, we empirically demonstrate the utility of our algorithms on synthetic and real-world datasets.
COR-GAN: Correlation-Capturing Convolutional Neural Networks for Generating Synthetic Healthcare Records
Torfi, Amirsina, Fox, Edward A.
Deep learning models have demonstrated high-quality performance in areas such as image classification and speech processing. However, creating a deep learning model using electronic health record (EHR) data, requires addressing particular privacy challenges that are unique to researchers in this domain. This matter focuses attention on generating realistic synthetic data while ensuring privacy. In this paper, we propose a novel framework called correlation-capturing Generative Adversarial Network (corGAN), to generate synthetic healthcare records. In corGAN we utilize Convolutional Neural Networks to capture the correlations between adjacent medical features in the data representation space by combining Convolutional Generative Adversarial Networks and Convolutional Autoencoders. To demonstrate the model fidelity, we show that corGAN generates synthetic data with performance similar to that of real data in various Machine Learning settings such as classification and prediction. We also give a privacy assessment and report on statistical analysis regarding realistic characteristics of the synthetic data. The software of this work is open-source and is available at: https://github.com/astorfi/cor-gan.
An Analysis of Word2Vec for the Italian Language
Di Gennaro, Giovanni, Buonanno, Amedeo, Di Girolamo, Antonio, Ospedale, Armando, Palmieri, Francesco A. N., Fedele, Gianfranco
Word representation is fundamental in NLP tasks, because it is precisely from the coding of semantic closeness between words that it is possible to think of teaching a machine to understand text. Despite the spread of word embedding concepts, still few are the achievements in linguistic contexts other than English. In this work, analysing the semantic capacity of the Word2Vec algorithm, an embedding for the Italian language is produced. Parameter setting such as the number of epochs, the size of the context window and the number of negatively backpropagated samples is explored. Keywords: Word2Vec, Word Embedding, NLP 1 Introduction In order to make human language comprehensible to a computer, it is obviously essential to provide some word encoding. The simplest approach is the one-hot encoding, where each word is represented by a sparse vector with dimension equal to the vocabulary size.