Markov Models
Language Generation via Combinatorial Constraint Satisfaction: A Tree Search Enhanced Monte-Carlo Approach
Zhang, Maosen, Jiang, Nan, Li, Lei, Xue, Yexiang
Generating natural language under complex constraints is a principled formulation towards controllable text generation. We present a framework to allow specification of combinatorial constraints for sentence generation. We propose TSMH, an efficient method to generate high likelihood sentences with respect to a pre-trained language model while satisfying the constraints. Our approach is highly flexible, requires no task-specific training, and leverages efficient constraint satisfaction solving techniques. To better handle the combinatorial constraints, a tree search algorithm is embedded into the proposal process of the Markov chain Monte Carlo (MCMC) to explore candidates that satisfy more constraints. Compared to existing MCMC approaches, our sampling approach has a better mixing performance. Experiments show that TSMH achieves consistent and significant improvement on multiple language generation tasks.
12th Asian Conference on Machine Learning
Last week saw the virtual running of the 12th Asian Conference on Machine Learning (ACML). The event had been due to be held in Thailand, but instead went online and the organisers decided to make all content freely available. You can watch all of the invited talks, tutorials, workshops, and video presentations of the contributed papers. Also, find out who won the conference awards. There were four invited speakers: Suriya Gunasekar (Microsoft Research, USA) Rethinking the role of optimization in learning This talk presented an overview of recent results towards understanding how we learn large capacity machine learning models.
Diluted Near-Optimal Expert Demonstrations for Guiding Dialogue Stochastic Policy Optimisation
Cordier, Thibault, Urvoy, Tanguy, Rojas-Barahona, Lina M., Lefèvre, Fabrice
These interactions can be taken from either human-to-human or human-machine conversations. However, human interactions are scarce and costly, making learning from few interactions essential. One solution to speedup the learning process is to guide the agent's exploration with the help of an expert. We present in this paper several imitation learning strategies for dialogue policy where the guiding expert is a near-optimal handcrafted policy. We incorporate these strategies with state-of-the-art reinforcement learning methods based on Q-learning and actorcritic. We notably propose a randomised exploration policy which allows for a seamless hybridisation of the learned policy and the expert, which can be seen as a dilution of the expert's demonstration into the resulting policy. Our experiments show that our hybridisation strategy outperforms several baselines, and that it could accelerate the learning when facing real humans.
Consistency-aware and Inconsistency-aware Graph-based Multi-view Clustering
Horie, Mitsuhiko, Kasai, Hiroyuki
Multi-view data analysis has gained increasing popularity because multi-view data are frequently encountered in machine learning applications. A simple but promising approach for clustering of multi-view data is multi-view clustering (MVC), which has been developed extensively to classify given subjects into some clustered groups by learning latent common features that are shared across multi-view data. Among existing approaches, graph-based multi-view clustering (GMVC) achieves state-of-the-art performance by leveraging a shared graph matrix called the unified matrix. However, existing methods including GMVC do not explicitly address inconsistent parts of input graph matrices. Consequently, they are adversely affected by unacceptable clustering performance. To this end, this paper proposes a new GMVC method that incorporates consistent and inconsistent parts lying across multiple views. This proposal is designated as CI-GMVC. Numerical evaluations of real-world datasets demonstrate the effectiveness of the proposed CI-GMVC.
Discovering Causal Structure with Reproducing-Kernel Hilbert Space $\epsilon$-Machines
Brodu, Nicolas, Crutchfield, James P.
We merge computational mechanics' definition of causal states (predictively-equivalent histories) with reproducing-kernel Hilbert space (RKHS) representation inference. The result is a widely-applicable method that infers causal structure directly from observations of a system's behaviors whether they are over discrete or continuous events or time. A structural representation -- a finite- or infinite-state kernel $\epsilon$-machine -- is extracted by a reduced-dimension transform that gives an efficient representation of causal states and their topology. In this way, the system dynamics are represented by a stochastic (ordinary or partial) differential equation that acts on causal states. We introduce an algorithm to estimate the associated evolution operator. Paralleling the Fokker-Plank equation, it efficiently evolves causal-state distributions and makes predictions in the original data space via an RKHS functional mapping. We demonstrate these techniques, together with their predictive abilities, on discrete-time, discrete-value infinite Markov-order processes generated by finite-state hidden Markov models with (i) finite or (ii) uncountably-infinite causal states and (iii) a continuous-time, continuous-value process generated by a thermally-driven chaotic flow. The method robustly estimates causal structure in the presence of varying external and measurement noise levels.
Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
He, Jiafan, Zhou, Dongruo, Gu, Quanquan
Designing efficient algorithms that learn and plan in sequential decision-making tasks with large state and action spaces has become a central task of modern reinforcement learning (RL) in recent years. RL often assumes the environment as a Markov Decision Process (MDP), described by a tuple of state space, action space, reward function, and transition probability function. Due to a large number of possible states and actions, traditional tabular reinforcement learning methods such as Q-learning (Watkins, 1989), which directly access each state-action pair, are computationally intractable. A common approach to cope with high-dimensional state and action spaces is to utilize feature mappings such as linear functions or neural networks to map states and actions to a low-dimensional space. Recently, a large body of literature has been devoted to provide regret bounds for online RL with linear function approximation. These works can be divided into two main categories. The first category of works is of model-free style, which directly parameterizes the action-value function as a linear function of some given feature mapping. For instance, Jin et al. (2020) studied the episodic MDPs with linear MDP assumption, which assumes that both transition probability function and reward function can be represented as a linear function of a given feature mapping.
Introduction to Reinforcement Learning (RL) -- Part 4 -- "Dynamic Programming"
Starting in this chapter, the assumption is that the environment is a finite Markov Decision Process (finite MDP). In this chapter we'll see how we can use DP algorithms to compute the value functions in a slightly different, less intractable way. The general idea is to take these 2 equations, and turn them into update rules for for improving the approximations of our value functions. It will make more sense later on. Policy Evaluation Policy evaluation means computing the state-value function Vπ for an arbitrary policy π.
Graph Signal Recovery Using Restricted Boltzmann Machines
Mohan, Ankith, Nakano, Aiichiro, Ferrara, Emilio
We propose a model-agnostic pipeline to recover graph signals from an expert system by exploiting the content addressable memory property of restricted Boltzmann machine and the representational ability of a neural network. The proposed pipeline requires the deep neural network that is trained on a downward machine learning task with clean data, data which is free from any form of corruption or incompletion. We show that denoising the representations learned by the deep neural networks is usually more effective than denoising the data itself. Although this pipeline can deal with noise in any dataset, it is particularly effective for graph-structured datasets.
Concentration inequality for U-statistics of order two for uniformly ergodic Markov chains, and applications
Duchemin, Quentin, de Castro, Yohann, Lacour, Claire
We prove a new concentration inequality for U-statistics of order two for uniformly ergodic Markov chains. Working with bounded $\pi$-canonical kernels, we show that we can recover the convergence rate of Arcones and Gine (1993) who proved a concentration result for U-statistics of independent random variables and canonical kernels. Our proof relies on an inductive analysis where we use martingale techniques, uniform ergodicity, Nummelin splitting and Bernstein's type inequality where the spectral gap of the chain emerges. Our result allows us to conduct three applications. First, we establish a new exponential inequality for the estimation of spectra of trace class integral operators with MCMC methods. The novelty is that this result holds for kernels with positive and negative eigenvalues, which is new as far as we know. In addition, we investigate generalization performance of online algorithms working with pairwise loss functions and Markov chain samples. We provide an online-to-batch conversion result by showing how we can extract a low risk hypothesis from the sequence of hypotheses generated by any online learner. We finally give a non-asymptotic analysis of a goodness-of-fit test on the density of the invariant measure of a Markov chain. We identify the classes of alternatives over which our test based on the L2 distance has a prescribed power.
Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?
de Witt, Christian Schroeder, Gupta, Tarun, Makoviichuk, Denys, Makoviychuk, Viktor, Torr, Philip H. S., Sun, Mingfei, Whiteson, Shimon
Most recently developed approaches to cooperative multi-agent reinforcement learning in the \emph{centralized training with decentralized execution} setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local value function, can perform just as well as or better than state-of-the-art joint learning approaches on popular multi-agent benchmark suite SMAC with little hyperparameter tuning. We also compare IPPO to several variants; the results suggest that IPPO's strong performance may be due to its robustness to some forms of environment non-stationarity.