Goto

Collaborating Authors

 Markov Models


Model-Based Episodic Memory Induces Dynamic Hybrid Controls

arXiv.org Artificial Intelligence

Episodic control enables sample efficiency in reinforcement learning by recalling past experiences from an episodic memory. We propose a new model-based episodic memory of trajectories addressing current limitations of episodic control. Our memory estimates trajectory values, guiding the agent towards good policies. Built upon the memory, we construct a complementary learning model via a dynamic hybrid control unifying model-based, episodic and habitual learning into a single architecture. Experiments demonstrate that our model allows significantly faster and better learning than other strong reinforcement learning agents across a variety of environments including stochastic and non-Markovian settings.


Learning to Cooperate with Unseen Agent via Meta-Reinforcement Learning

arXiv.org Artificial Intelligence

Ad hoc teamwork problem describes situations where an agent has to cooperate with previously unseen agents to achieve a common goal. For an agent to be successful in these scenarios, it has to have a suitable cooperative skill. One could implement cooperative skills into an agent by using domain knowledge to design the agent's behavior. However, in complex domains, domain knowledge might not be available. Therefore, it is worthwhile to explore how to directly learn cooperative skills from data. In this work, we apply meta-reinforcement learning (meta-RL) formulation in the context of the ad hoc teamwork problem. Our empirical results show that such a method could produce robust cooperative agents in two cooperative environments with different cooperative circumstances: social compliance and language interpretation. (This is a full paper of the extended abstract version.)


Learning for Structured Prediction

#artificialintelligence

Structured prediction is the main term for supervised machine learning techniques. Those techniques are involved predicting structured objects, instead of scalar discrete or real values. Structured prediction models are normally trained by means of observed data. In which the true value is used to regulate model parameters similar to usually used supervised learning techniques. The process of prediction using a trained model and of training the aforementioned is frequently computationally infeasible.


Speech Recognition Transformation

#artificialintelligence

Voice technology has reached maturity. The quality of speech recognition surpassed 95 percent accuracy in 2020. That is the same quality as normal communication between human beings. And the influence is now being felt. The modern Microsoft Windows update vigorously pushes its voice feature -- a mechanism that allows the user to dictate messages at the speed of normal speech, which is four times faster than typing.


Efficient Learning of the Parameters of Non-Linear Models using Differentiable Resampling in Particle Filters

arXiv.org Machine Learning

It has been widely documented that the sampling and resampling steps in particle filters cannot be differentiated. The {\itshape reparameterisation trick} was introduced to allow the sampling step to be reformulated into a differentiable function. We extend the {\itshape reparameterisation trick} to include the stochastic input to resampling therefore limiting the discontinuities in the gradient calculation after this step. Knowing the gradients of the prior and likelihood allows us to run particle Markov Chain Monte Carlo (p-MCMC) and use the No-U-Turn Sampler (NUTS) as the proposal when estimating parameters. We compare the Metropolis-adjusted Langevin algorithm (MALA), Hamiltonian Monte Carlo with different number of steps and NUTS. We consider two state-space models and show that NUTS improves the mixing of the Markov chain and can produce more accurate results in less computational time.


A Review of Dialogue Systems: From Trained Monkeys to Stochastic Parrots

arXiv.org Artificial Intelligence

In spoken dialogue systems, we aim to deploy artificial intelligence to build automated dialogue agents that can converse with humans. Dialogue systems are increasingly being designed to move beyond just imitating conversation and also improve from such interactions over time. In this survey, we present a broad overview of methods developed to build dialogue systems over the years. Different use cases for dialogue systems ranging from task-based systems to open domain chatbots motivate and necessitate specific systems. Starting from simple rule-based systems, research has progressed towards increasingly complex architectures trained on a massive corpus of datasets, like deep learning systems. Motivated with the intuition of resembling human dialogues, progress has been made towards incorporating emotions into the natural language generator, using reinforcement learning. While we see a trend of highly marginal improvement on some metrics, we find that limited justification exists for the metrics, and evaluation practices are not uniform. To conclude, we flag these concerns and highlight possible research directions.


Learning to Explore by Reinforcement over High-Level Options

arXiv.org Artificial Intelligence

Autonomous 3D environment exploration is a fundamental task for various applications such as navigation. The goal of exploration is to investigate a new environment and build its occupancy map efficiently. In this paper, we propose a new method which grants an agent two intertwined options of behaviors: "look-around" and "frontier navigation". This is implemented by an option-critic architecture and trained by reinforcement learning algorithms. In each timestep, an agent produces an option and a corresponding action according to the policy. We also take advantage of macro-actions by incorporating classic path-planning techniques to increase training efficiency. We demonstrate the effectiveness of the proposed method on two publicly available 3D environment datasets and the results show our method achieves higher coverage than competing techniques with better efficiency.


Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

arXiv.org Artificial Intelligence

Reinforcement learning (RL) is one of the most important paradigms in machine learning. What makes RL different from other paradigms is that it models the long-term effects in decision-making problems. For instance, in a finite-horizon Markov decision process (MDP), which is one of the most fundamental models for RL, an agent interacts with the environment for a total of H steps and receives a sequence of H random reward values, along with stochastic state transitions, as feedback. The goal of the agent is to find a policy to maximize the expected sum of these rewards values instead of any single one of them. Since decisions made at early stages could significantly impact the future, the agent must take possible future transitions into consideration when choosing the policy. On the other hand, when H 1, RL reduces to the contextual bandits problem in which it suffices to act myopically to achieve optimality. Due to the important role of the horizon length in RL, Jiang and Agarwal [JA18] propose to study how the sample complexity of RL depends on the horizon length. More formally, let us consider the episodic RL setting, where the horizon length is H and the underlying MDP has unknown and time invariant transition probabilities and rewards.


Fast Global Convergence of Policy Optimization for Constrained MDPs

arXiv.org Artificial Intelligence

We address the issue of safety in reinforcement learning. We pose the problem in a discounted infinite-horizon constrained Markov decision process framework. Existing results have shown that gradient-based methods are able to achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate both for the optimality gap and the constraint violation. We exhibit a natural policy gradient-based algorithm that has a faster convergence rate $\mathcal{O}(\log(T)/T)$ for both the optimality gap and the constraint violation. When Slater's condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large $T$ while maintaining the same convergence rate.


Intrusion Prevention through Optimal Stopping

arXiv.org Artificial Intelligence

We study automated intrusion prevention using reinforcement learning. Following a novel approach, we formulate the problem of intrusion prevention as an (optimal) multiple stopping problem. This formulation gives us insight into the structure of optimal policies, which we show to have threshold properties. For most practical cases, it is not feasible to obtain an optimal defender policy using dynamic programming. We therefore develop a reinforcement learning approach to approximate an optimal policy. Our method for learning and validating policies includes two systems: a simulation system where defender policies are incrementally learned and an emulation system where statistics are produced that drive simulation runs and where learned policies are evaluated. We show that our approach can produce effective defender policies for a practical IT infrastructure of limited size. Inspection of the learned policies confirms that they exhibit threshold properties.