Undirected Networks
Estimating Disentangled Belief about Hidden State and Hidden Task for Meta-RL
Akuzawa, Kei, Iwasawa, Yusuke, Matsuo, Yutaka
There is considerable interest in designing meta-reinforcement learning (meta-RL) algorithms, which enable autonomous agents to adapt new tasks from small amount of experience. In meta-RL, the specification (such as reward function) of current task is hidden from the agent. In addition, states are hidden within each task owing to sensor noise or limitations in realistic environments. Therefore, the meta-RL agent faces the challenge of specifying both the hidden task and states based on small amount of experience. To address this, we propose estimating disentangled belief about task and states, leveraging an inductive bias that the task and states can be regarded as global and local features of each task. Specifically, we train a hierarchical state-space model (HSSM) parameterized by deep neural networks as an environment model, whose global and local latent variables correspond to task and states, respectively. Because the HSSM does not allow analytical computation of posterior distribution, i.e., belief, we employ amortized inference to approximate it. After the belief is obtained, we can augment observations of a model-free policy with the belief to efficiently train the policy. Moreover, because task and state information are factorized and interpretable, the downstream policy training is facilitated compared with the prior methods that did not consider the hierarchical nature. Empirical validations on a GridWorld environment confirm that the HSSM can separate the hidden task and states information. Then, we compare the meta-RL agent with the HSSM to prior meta-RL methods in MuJoCo environments, and confirm that our agent requires less training data and reaches higher final performance.
A cell type-specific cortico-subcortical brain circuit for investigatory and novelty-seeking behavior
Curiosity is what drives organisms to investigate each other and their environment. It is considered by many to be as intrinsic as hunger and thirst, but the neurobiological mechanisms behind curiosity have remained elusive. In mice, Ahmadlou et al. found that a specific population of genetically identified ฮณ-aminobutyric acid (GABA)โergic neurons in a brain region called the zona incerta receive excitatory input in the form of novelty and/or arousal information from the prelimbic cortex, and these neurons send inhibitory projections to the periaqueductal gray region (see the Perspective by Farahbakhsh and Siciliano). This circuitry is necessary for the exploration of new objects and conspecifics. Science , this issue p. [eabe9681][1]; see also p. [684][2] ### INTRODUCTION Motivational drives are internal states that can be different even in similar interactions with external stimuli. Curiosity as the motivational drive for novelty-seeking and investigating the surrounding environment is for survival as essential and intrinsic as hunger. Curiosity, hunger, and appetitive aggression drive three different goal-directed behaviorsโnovelty seeking, food eating, and huntingโbut these behaviors are composed of similar actions in animals. This similarity of actions has made it challenging to study novelty seeking and distinguish it from eating and hunting in nonarticulating animals. The brain mechanisms underlying this basic survival drive, curiosity, and novelty-seeking behavior have remained unclear. ### RATIONALE In spite of having well-developed techniques to study mouse brain circuits, there are many controversial and different results in the field of motivational behavior. This has left the functions of motivational brain regions such as the zona incerta (ZI) still uncertain. Not having a transparent, nonreinforced, and easily replicable paradigm is one of the main causes of this uncertainty. Therefore, we chose a simple solution to conduct our research: giving the mouse freedom to choose what it wantsโdouble free-access choice. By examining mice in an experimental battery of object free-access double-choice (FADC) and social interaction testsโusing optogenetics, chemogenetics, calcium fiber photometry, multichannel recording electrophysiology, and multicolor mRNA in situ hybridizationโwe uncovered a cell typeโspecific cortico-subcortical brain circuit of the curiosity and novelty-seeking behavior. ### RESULTS We analyzed the transitions within action sequences in object FADC and social interaction tests. Frequency and hidden Markov model analyses showed that mice choose different action sequences in interaction with novel objects and in early periods of interaction with novel conspecifics compared with interaction with familiar objects or later periods of interaction with conspecifics, which we categorized as deep and shallow investigation, respectively. This finding helped us to define a measure of depth of investigation that indicates how much a mouse prefers deep over shallow investigation and reflects the mouseโs motivational level to investigate, regardless of total duration of investigation. Optogenetic activation of inhibitory neurons in medial ZI (ZIm), ZImGAD2 neurons, showed a dramatic increase in positive arousal level, depth of investigation, and duration of interaction with conspecifics and novel objects compared with familiar objects, crickets, and food. Optogenetic or chemogenetic deactivation of these neurons decreased depth and duration of investigation. Moreover, we found that ZImGAD2 neurons are more active during deep investigation as compared with during shallow investigation. We found that activation of prelimbic cortex (PL) axons into ZIm increases arousal level, and chemogenetic deactivation of these axons decreases the duration and depth of investigation. Calcium fiber photometry of these axons showed no difference in activity between shallow and deep investigation, suggesting a nonspecific motivation. Optogenetic activation of ZImGAD2 axons into lateral periaqueductal gray (lPAG) increases the arousal level, whereas chemogenetic deactivation of these axons decreases duration and depth of investigation. Calcium fiber photometry of these axons showed high activity during deep investigation and no significant activity during shallow investigation, suggesting a thresholding mechanism. Last, we found a new subpopulation of inhibitory neurons in ZIm expressing tachykinin 1 (TAC1) that monosynaptically receive PL inputs and project to lPAG. Optogenetic activation and deactivation of these neurons, respectively, increased and decreased depth and duration of investigation. ### CONCLUSION Our experiments revealed different action sequences based on the motivational level of novelty seeking. Moreover, we uncovered a new brain circuit underlying curiosity and novelty-seeking behavior, connecting excitatory neurons of PL to lPAG through TAC1+ inhibitory neurons of ZIm. ![Figure][3] Brain mechanism of curiosity. ( A ) How we mapped motivational level to action sequences. ( B ) Experimental battery to distinguish novelty-seeking behavior from food eating and hunting in mice with photoactivation of ZImGAD2 neurons. ( C ) Schematic of calcium activity in PLโZIm, ZIm, and ZImโPAG during shallow and deep investigation. ( D ) TAC1+ neurons as a subpopulation of ZImGAD2 neurons receive input from PL and project to PAG. HMM, hidden Markov model. Exploring the physical and social environment is essential for understanding the surrounding world. We do not know how novelty-seeking motivation initiates the complex sequence of actions that make up investigatory behavior. We found in mice that inhibitory neurons in the medial zona incerta (ZIm), a subthalamic brain region, are essential for the decision to investigate an object or a conspecific. These neurons receive excitatory input from the prelimbic cortex to signal the initiation of exploration. This signal is modulated in the ZIm by the level of investigatory motivation. Increased activity in the ZIm instigates deep investigative action by inhibiting the periaqueductal gray region. A subpopulation of inhibitory ZIm neurons expressing tachykinin 1 (TAC1) modulates the investigatory behavior. [1]: /lookup/doi/10.1126/science.abe9681 [2]: /lookup/doi/10.1126/science.abi7270 [3]: pending:yes
Reinforcement Learning Based Safe Decision Making for Highway Autonomous Driving
Mohammadhasani, Arash, Mehrivash, Hamed, Lynch, Alan, Shu, Zhan
In this paper, we develop a safe decision-making method for self-driving cars in a multi-lane, single-agent setting. The proposed approach utilizes deep reinforcement learning (RL) to achieve a high-level policy for safe tactical decision-making. We address two major challenges that arise solely in autonomous navigation. First, the proposed algorithm ensures that collisions never happen, and therefore accelerate the learning process. Second, the proposed algorithm takes into account the unobservable states in the environment. These states appear mainly due to the unpredictable behavior of other agents, such as cars, and pedestrians, and make the Markov Decision Process (MDP) problematic when dealing with autonomous navigation. Simulations from a well-known self-driving car simulator demonstrate the applicability of the proposed method
Identity testing of reversible Markov chains
We consider the problem of identity testing of Markov chains based on a single trajectory of observations under the distance notion introduced by Daskalakis et al. [2018a] and further analyzed by Cherapanamjeri and Bartlett [2019]. Both works made the restrictive assumption that the Markov chains under consideration are symmetric. In this work we relax the symmetry assumption to the more natural assumption of reversibility, still assuming that both the reference and the unknown Markov chains share the same stationary distribution.
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
Popov, Vadim, Vovk, Ivan, Gogoryan, Vladimir, Sadekova, Tasnima, Kudinov, Mikhail
Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. We will make the code publicly available shortly.
Recent Advances in Deep Learning-based Dialogue Systems
Ni, Jinjie, Young, Tom, Pandelea, Vlad, Xue, Fuzhao, Adiga, Vinay, Cambria, Erik
Dialogue systems are a popular Natural Language Processing (NLP) task as it is promising in real-life applications. It is also a complicated task since many NLP tasks deserving study are involved. As a result, a multitude of novel works on this task are carried out, and most of them are deep learning-based due to the outstanding performance. In this survey, we mainly focus on the deep learning-based dialogue systems. We comprehensively review state-of-the-art research outcomes in dialogue systems and analyze them from two angles: model type and system type. Specifically, from the angle of model type, we discuss the principles, characteristics, and applications of different models that are widely used in dialogue systems. This will help researchers acquaint these models and see how they are applied in state-of-the-art frameworks, which is rather helpful when designing a new dialogue system. From the angle of system type, we discuss task-oriented and open-domain dialogue systems as two streams of research, providing insight into the hot topics related. Furthermore, we comprehensively review the evaluation methods and datasets for dialogue systems to pave the way for future research. Finally, some possible research trends are identified based on the recent research outcomes. To the best of our knowledge, this survey is the most comprehensive and up-to-date one at present in the area of dialogue systems and dialogue-related tasks, extensively covering the popular frameworks, topics, and datasets. Keywords: Dialogue Systems, Chatbots, Conversational AI, Task-oriented, Open Domain, Chit-chat, Question Answering, Artificial Intelligence, Natural Language Processing, Information Retrieval, Deep Learning, Neural Networks, CNN, RNN, Hierarchical Recurrent Encoder-Decoder, Memory Networks, Attention, Transformer, Pointer Net, CopyNet, Reinforcement Learning, GANs, Knowledge Graph, Survey, Review
MMGET: A Markov model for generalized evidence theory
In real life, lots of information merges from time to time. To appropriately describe the actual situations, lots of theories have been proposed. Among them, Dempster-Shafer evidence theory is a very useful tool in managing uncertain information. To better adapt to complex situations of open world, a generalized evidence theory is designed. However, everything occurs in sequence and owns some underlying relationships with each other. In order to further embody the details of information and better conforms to situations of real world, a Markov model is introduced into the generalized evidence theory which helps extract complete information volume from evidence provided. Besides, some numerical examples is offered to verify the correctness and rationality of the proposed method.
Characterizing Uniform Convergence in Offline Policy Evaluation via model-based approach: Offline Learning, Task-Agnostic and Reward-Free
We study the statistical limits of uniform convergence for offline policy evaluation (OPE) problems (uniform OPE for short) with model-based methods under episodic MDP setting. Uniform OPE $\sup_\Pi|Q^\pi-\hat{Q}^\pi|<\epsilon$ (initiated by Yin et al. 2021) is a stronger measure than the point-wise (fixed policy) OPE and ensures offline policy learning when $\Pi$ contains all policies (we call it global policy class). In this paper, we establish an $\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over model-based family) for the global uniform OPE, where $d_m$ is the minimal state-action distribution induced by the behavior policy. The order $S/d_m\epsilon^2$ reveals global uniform OPE task is intrinsically harder than offline policy learning due to the extra $S$ factor. Next, our main result establishes an episode complexity of $\tilde{O}(H^2/d_m\epsilon^2)$ for \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. The result implies the optimal sample complexity for offline learning and separates local uniform OPE from the global case. Paramountly, the model-based method combining with our new analysis technique (singleton absorbing MDP) can be adapted to the new settings: offline task-agnostic and the offline reward-free with optimal complexity $\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and $\tilde{O}(H^2S/d_m\epsilon^2)$ respectively, which provides a unified framework for simultaneously solving different offline RL problems.
Online POMDP Planning via Simplification
Sztyglic, Ori, Indelman, Vadim
In this paper, we consider online planning in partially observable domains. Solving the corresponding POMDP problem is a very challenging task, particularly in an online setting. Our key contribution is a novel algorithmic approach, Simplified Information Theoretic Belief Space Planning (SITH-BSP), which aims to speed-up POMDP planning considering belief-dependent rewards, without compromising on the solution's accuracy. We do so by mathematically relating the simplified elements of the problem to the corresponding counterparts of the original problem. Specifically, we focus on belief simplification and use it to formulate bounds on the corresponding original belief-dependent rewards. These bounds in turn are used to perform branch pruning over the belief tree, in the process of calculating the optimal policy. We further introduce the notion of adaptive simplification, while re-using calculations between different simplification levels and exploit it to prune, at each level in the belief tree, all branches but one. Therefore, our approach is guaranteed to find the optimal solution of the original problem but with substantial speedup. As a second key contribution, we derive novel analytical bounds for differential entropy, considering a sampling-based belief representation, which we believe are of interest on their own. We validate our approach in simulation using these bounds and where simplification corresponds to reducing the number of samples, exhibiting a significant computational speedup while yielding the optimal solution.
Boltzmann machines as two-dimensional tensor networks
Li, Sujie, Pan, Feng, Zhou, Pengfei, Zhang, Pan
Restricted Boltzmann machines (RBM) and deep Boltzmann machines (DBM) are important models in machine learning, and recently found numerous applications in quantum many-body physics. We show that there are fundamental connections between them and tensor networks. In particular, we demonstrate that any RBM and DBM can be exactly represented as a two-dimensional tensor network. This representation gives an understanding of the expressive power of RBM and DBM using entanglement structures of the tensor networks, also provides an efficient tensor network contraction algorithm for the computing partition function of RBM and DBM. Using numerical experiments, we demonstrate that the proposed algorithm is much more accurate than the state-of-the-art machine learning methods in estimating the partition function of restricted Boltzmann machines and deep Boltzmann machines, and have potential applications in training deep Boltzmann machines for general machine learning tasks.