Goto

Collaborating Authors

 Undirected Networks


Estimation in Tensor Ising Models

arXiv.org Machine Learning

The p-tensor Ising model is a one-parameter discrete exponential family for modeling dependent binary data, where the sufficient statistic is a multi-linear form of degree p 2. This is a natural generalization of the matrix Ising model, that provides a convenient mathematical framework for capturing, not just pairwise, but higher-order dependencies in complex relational data. In this paper, we consider the problem of estimating the natural parameter of the p-tensor Ising model given a single sample from the distribution on N nodes. Our estimate is based on the maximum pseudo-likelihood (MPL) method, which provides a computationally efficient algorithm for estimating the parameter that avoids computing the intractable partition function. We derive general conditions under which the MPL estimate is N-consistent, that is, it converges to the true parameter at rate 1/ N. Our conditions are robust enough to handle a variety of commonly used tensor Ising models, including spin glass models with random interactions and models where the rate of estimation undergoes a phase transition. In particular, this includes results on N-consistency of the MPL estimate in the well-known p-spin Sherrington-Kirkpatrick (SK) model, spin systems on general p-uniform hypergraphs, and Ising models on the hypergraph stochastic block model (HSBM). In fact, for the HSBM we pin down the exact location of the phase transition threshold, which is determined by the positivity of a certain mean-field variational problem, such that above this threshold the MPL estimate is N-consistent, while below the threshold no estimator is consistent. Finally, we derive the precise fluctuations of the MPL estimate in the special case of the p-tensor Curie-Weiss model, which is the Ising model on the complete p-uniform hypergraph. An interesting consequence of our results is that the MPL estimate in the Curie-Weiss model saturates the Cramer-Rao lower bound at all points above the estimation threshold, that is, the MPL estimate incurs no loss in asymptotic statistical efficiency in the estimability regime, even though it is obtained by minimizing only an approximation of the true likelihood function for computational tractability.


Efficiently Solving MDPs with Stochastic Mirror Descent

arXiv.org Machine Learning

Markov decision processes (MDPs) are a fundamental mathematical abstraction for sequential decision making under uncertainty and they serve as a basic modeling tool in reinforcement learning (RL) and stochastic control [5, 24, 30]. Two prominent classes of MDPs are average-reward MDPs (AMDPs) and discounted MDPs (DMDPs). Each have been studied extensively; AMDPs are applicable to optimal control, learning automata, and various real-world reinforcement learning settings [17, 3, 22] and DMDPs have a number of nice theoretical properties including reward convergence and operator monotonicity [6]. In this paper we consider the prevalent computational learning problem of finding an approximately optimal policy of an MDP given only restricted access to the model. In particular, we consider the problem of computing an ษ›-optimal policy, i.e. a policy with an additive ษ› error in expected cumulative reward over infinite horizon, under the standard assumption of a generative model [14, 13], which allows one to sample from state-transitions given the current state-action pair. This problem is well-studied and there are multiple known upper and lower bounds on its sample complexity [4, 32, 28, 31]. In this work, we provide a unified framework based on primal-dual stochastic mirror descent (SMD) for learning an ษ›-optimal policies for both AMDPs and DMDPs with a generative model.


Strategies for navigating a dynamic world

Science

One of the most difficult problems for an adaptable agent is gauging how to behave in a nonstationary environment. When conditions are stable, an organism generally pursues a strategy known to provide the best outcome. However, when environmental conditions change, an organism abandons the current action plan and searches for a new best option. The most challenging aspect of this searchโ€”calculating the exact time point at which to change strategiesโ€”requires the brain to integrate past and present observations and evaluate whether they remain consistent with current environmental conditions. On page 1076 of this issue, Domenech et al. ([ 1 ][1]) report on the modeling of rare direct electrical recordings from the prefrontal cortices (PFCs) of a small group of human epilepsy patients as they flexibly negotiated a nonstationary environment. To understand the brain's mode of navigation, consider for example a sailor at sea (see the figure). The winds and the currents determine the waves that drive the sailor to continuously adjust the rudder so as to stay on course. By observing the wave patterns, he can anticipate the navigational effects of his actions and adapt accordingly. But when the currents or the weather changes, the sailor must adapt his course to reach the next port of call. At that time, the sailor observes essentially the same stimulus (the waves) but has to remap his action plan (rudder adjustments) to the new wind conditions and currents. This difficult-decision problemโ€”how to detect and then adapt to a nonstationary environmentโ€”is captured perfectly in the exploration-exploitation dilemma: When should I stop exploiting my current action plan and start exploring different ways to reach my goals? An optimal solution tracks the discounted sum of normalized future rewards. However, this approach applies strictly to stationary environments and thus does not capture the dynamic changes that organisms encounter in their daily lives ([ 2 ][2]). Yet the human brain and those of other species seem to smoothly solve the exploration-exploitation dilemma in nonstationary environments. Decision neuroscience has investigated the flexible adaptation to changing environmental contingencies with diverse experimental paradigms and assorted computational models. The simplest paradigm is probabilistic reversal learning, in which the agent has to search for reward among two options with complementary reward probabilities. This adaptation problem can be solved by hidden Markov models ([ 3 ][3]), which are well-approximated by reinforcement learning (RL) models that also update nonchosen actions ([ 4 ][4]). Extension of this paradigm to include independently changing reward probabilities reveals two distinct neural responses: Expected-value signals, which reflect โ€œexploitativeโ€ choices, spur activation of the ventromedial prefrontal cortex (vmPFC); and โ€œexplorativeโ€ choices (that is, the choosing of a currently lesser valued option) activate the frontopolar cortex ([ 5 ][5]). ![Figure][6] A sailor solves a dilemma at sea As the ship nears bad weather, the sailor's ventromedial prefrontal cortex (vmPFC) evaluates the ongoing (orange) action plan (exploitation) and the prospective (brown, red) plans (exploration). Once the red (calm waters) plan is exploited, the sailor's dorsomedial PFC (dmPFC) uses trial-and-error learning to map the proper rudder adjustments. GRAPHIC: A. KITTERMAN/ SCIENCE Another task with both rapid and slow changes in the reward probabilities of various options was used to develop a hierarchical Bayesian model that estimates the volatility of the environment and adjusts the learning rate accordingly ([ 6 ][7]). This model has found its generalization in the hierarchical Gaussian filter (HGF) framework ([ 7 ][8]), which is widely used in modeling social and nonsocial human decision-making in nonstationary environments. Although these computational modeling frameworks differ, all are trying to solve similar problems: How to infer the latent structure of the world from discrete observations and how to detect transitions between different states of the world. Domenech et al. address the same problems with yet another experimental paradigm, this one carried out with a small group of human epilepsy patients. Electrodes deeply implanted in the patients' PFCs delivered direct electrical recordings from the vmPFC and dorsomedial PFC (dmPFC) while the patients performed a multioption decision task. The participants had to associate three different stimuli with three distinct actions, thus constituting an action plan. The mapping changed every 33 to 57 trials, and participants had to relearn the association of the same stimuli with a different combination of actions, much like our sailor at sea who faces changes in weather and currents that alter wave patterns. The computational model ([ 8 ][9]) generates a reliability value for the ongoing action plan and other concurrently monitored plans. When the ongoing action plan is deemed reliable, the model is in โ€œexploitationโ€ mode and learns the stimulus-action mapping through RL mechanisms. When the ongoing action plan is deemed unreliable, the model switches to โ€œexplorationโ€ mode. New provisional action plans are created and evaluated, until one emerges as a reliable predictor for successful stimulus-action mapping (see the figure). Using a state-of-the-art model-based analysis that associates the model-derived variables with the brain activity in various frequency bands of the neural recordings, the authors found a delicate interplay between the vmPFC and dmPFC that supports a predictive coding interpretation for resolution of the exploration-exploitation dilemma. vmPFC monitors and represents the reliability of the ongoing action plan. vmPFC relays the ongoing action plan to the dmPFC as either a โ€œstayโ€ or โ€œswitchโ€ trial. A stay trial triggers additional learning through RL mechanisms in the dmPFC. In contrast, the dmPFC responds to a switch trial by suppressing activity related to maintaining the ongoing action plan. These findings resonate with and extend earlier results obtained with functional neuroimaging ([ 5 ][5], [ 9 ][10]). These computational approaches to the problem of behavioral flexibility in a nonstationary environment share one commonality: They are all building a model of the environment and the transition therein, either explicitly (as in the HGF framework) or implicitly (by evaluating the ongoing action plan, as in the Domenech et al. study). Although all of these models strive for generality, each was developed for a specific experimental context. It remains to be seen which of these provides the best account of flexible decision-making in humans and other species, preferably using a unified experimental paradigm. A model-free RL account ([ 10 ][11]) likely will not suffice, as several studies have demonstrated the superiority of more-complex models over this โ€œvanillaโ€ RL model. Rather, an agent requires a rich representation of the environment and its dynamic transitions (often referred to as model-based learning) ([ 10 ][11]) to solve the exploration-exploitation dilemma and flexibly respond to a changing world. 1. [โ†ต][12]1. P. Domenech, 2. S. Rheims, 3. E. Koechlin , Science 369, eabb0184 (2020). [OpenUrl][13][CrossRef][14] 2. [โ†ต][15]1. J. D. Cohen, 2. S. M. McClure, 3. A. J. Yu , Philos. Trans. R. Soc. London Ser. B 362, 933 (2007). [OpenUrl][16][CrossRef][17][PubMed][18] 3. [โ†ต][19]1. A. N. Hampton, 2. P. Bossaerts, 3. J. P. O'Doherty , J. Neurosci. 26, 8360 (2006). [OpenUrl][20][Abstract/FREE Full Text][21] 4. [โ†ต][22]1. J. Glรคscher, 2. A. N. Hampton, 3. J. P. O'Doherty , Cereb. Cortex 19, 483 (2009). [OpenUrl][23][CrossRef][24][PubMed][25][Web of Science][26] 5. [โ†ต][27]1. N. D. Daw, 2. J. P. O'Doherty, 3. P. Dayan, 4. B. Seymour, 5. R. J. Dolan , Nature 441, 876 (2006). [OpenUrl][28][CrossRef][29][PubMed][30][Web of Science][31] 6. [โ†ต][32]1. T. E. J. Behrens, 2. M. W. Woolrich, 3. M. E. Walton, 4. M. F. S. Rushworth , Nat. Neurosci. 10, 1214 (2007). [OpenUrl][33][CrossRef][34][PubMed][35][Web of Science][36] 7. [โ†ต][37]1. C. Mathys, 2. J. Daunizeau, 3. K. J. Friston, 4. K. E. Stephan , Front. Hum. Neurosci. 5, 39 (2011). [OpenUrl][38][CrossRef][39][PubMed][40] 8. [โ†ต][41]1. A. Collins, 2. E. Koechlin , PLOS Biol. 10, e1001293 (2012). [OpenUrl][42][CrossRef][43][PubMed][44] 9. [โ†ต][45]1. M. Donoso, 2. A. G. E. Collins, 3. E. Koechlin , Science 344, 1481 (2014). [OpenUrl][46][Abstract/FREE Full Text][47] 10. [โ†ต][48]1. N. D. Daw, 2. P. Dayan , Philos. Trans. R. Soc. London Ser. B 369, 20130478 (2014). [OpenUrl][49][CrossRef][50][PubMed][51] [1]: #ref-1 [2]: #ref-2 [3]: #ref-3 [4]: #ref-4 [5]: #ref-5 [6]: pending:yes [7]: #ref-6 [8]: #ref-7 [9]: #ref-8 [10]: #ref-9 [11]: #ref-10 [12]: #xref-ref-1-1 "View reference 1 in text" [13]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DDomenech%26rft.auinit1%253DP.%26rft.volume%253D369%26rft.issue%253D6507%26rft.spage%253Deabb0184%26rft.epage%253Deabb0184%26rft.atitle%253DNeural%2Bmechanisms%2Bresolving%2Bexploitation-exploration%2Bdilemmas%2Bin%2Bthe%2Bmedial%2Bprefrontal%2Bcortex%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.abb0184%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [14]: /lookup/external-ref?access_num=10.1126/science.abb0184&link_type=DOI [15]: #xref-ref-2-1 "View reference 2 in text" [16]: {openurl}?query=rft.jtitle%253DPhilosophical%2BTransactions%2Bof%2Bthe%2BRoyal%2BSociety%2BB%253A%2BBiological%2BSciences%26rft.stitle%253DPhil%2BTrans%2BR%2BSoc%2BB%26rft.aulast%253DCohen%26rft.auinit1%253DJ.%2BD%26rft.volume%253D362%26rft.issue%253D1481%26rft.spage%253D933%26rft.epage%253D942%26rft.atitle%253DShould%2BI%2Bstay%2Bor%2Bshould%2BI%2Bgo%253F%2BHow%2Bthe%2Bhuman%2Bbrain%2Bmanages%2Bthe%2Btrade-off%2Bbetween%2Bexploitation%2Band%2Bexploration%26rft_id%253Dinfo%253Adoi%252F10.1098%252Frstb.2007.2098%26rft_id%253Dinfo%253Apmid%252F17395573%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [17]: /lookup/external-ref?access_num=10.1098/rstb.2007.2098&link_type=DOI [18]: /lookup/external-ref?access_num=17395573&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [19]: #xref-ref-3-1 "View reference 3 in text" [20]: {openurl}?query=rft.jtitle%253DJournal%2Bof%2BNeuroscience%26rft.stitle%253DJ.%2BNeurosci.%26rft.aulast%253DHampton%26rft.auinit1%253DA.%2BN.%26rft.volume%253D26%26rft.issue%253D32%26rft.spage%253D8360%26rft.epage%253D8367%26rft.atitle%253DThe%2BRole%2Bof%2Bthe%2BVentromedial%2BPrefrontal%2BCortex%2Bin%2BAbstract%2BState-Based%2BInference%2Bduring%2BDecision%2BMaking%2Bin%2BHumans%26rft_id%253Dinfo%253Adoi%252F10.1523%252FJNEUROSCI.1010-06.2006%26rft_id%253Dinfo%253Apmid%252F16899731%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [21]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Njoiam5ldXJvIjtzOjU6InJlc2lkIjtzOjEwOiIyNi8zMi84MzYwIjtzOjQ6ImF0b20iO3M6MjM6Ii9zY2kvMzY5LzY1MDcvMTA1Ni5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30= [22]: #xref-ref-4-1 "View reference 4 in text" [23]: {openurl}?query=rft.jtitle%253DCereb.%2BCortex%26rft_id%253Dinfo%253Adoi%252F10.1093%252Fcercor%252Fbhn098%26rft_id%253Dinfo%253Apmid%252F18550593%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [24]: /lookup/external-ref?access_num=10.1093/cercor/bhn098&link_type=DOI [25]: /lookup/external-ref?access_num=18550593&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [26]: /lookup/external-ref?access_num=000262518800023&link_type=ISI [27]: #xref-ref-5-1 "View reference 5 in text" [28]: {openurl}?query=rft.jtitle%253DNature%26rft.stitle%253DNature%26rft.aulast%253DDaw%26rft.auinit1%253DN.%2BD.%26rft.volume%253D441%26rft.issue%253D7095%26rft.spage%253D876%26rft.epage%253D879%26rft.atitle%253DCortical%2Bsubstrates%2Bfor%2Bexploratory%2Bdecisions%2Bin%2Bhumans.%26rft_id%253Dinfo%253Adoi%252F10.1038%252Fnature04766%26rft_id%253Dinfo%253Apmid%252F16778890%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [29]: /lookup/external-ref?access_num=10.1038/nature04766&link_type=DOI [30]: /lookup/external-ref?access_num=16778890&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [31]: /lookup/external-ref?access_num=000238254100043&link_type=ISI [32]: #xref-ref-6-1 "View reference 6 in text" [33]: {openurl}?query=rft.jtitle%253DNature%2Bneuroscience%26rft.stitle%253DNat%2BNeurosci%26rft.aulast%253DBehrens%26rft.auinit1%253DT.%2BE.%26rft.volume%253D10%26rft.issue%253D9%26rft.spage%253D1214%26rft.epage%253D1221%26rft.atitle%253DLearning%2Bthe%2Bvalue%2Bof%2Binformation%2Bin%2Ban%2Buncertain%2Bworld.%26rft_id%253Dinfo%253Adoi%252F10.1038%252Fnn1954%26rft_id%253Dinfo%253Apmid%252F17676057%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [34]: /lookup/external-ref?access_num=10.1038/nn1954&link_type=DOI [35]: /lookup/external-ref?access_num=17676057&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [36]: /lookup/external-ref?access_num=000249144000025&link_type=ISI [37]: #xref-ref-7-1 "View reference 7 in text" [38]: {openurl}?query=rft.stitle%253DFront%2BHum%2BNeurosci%26rft.aulast%253DMathys%26rft.auinit1%253DC.%26rft.volume%253D5%26rft.spage%253D39%26rft.epage%253D39%26rft.atitle%253DA%2Bbayesian%2Bfoundation%2Bfor%2Bindividual%2Blearning%2Bunder%2Buncertainty.%26rft_id%253Dinfo%253Adoi%252F10.3389%252Ffnhum.2011.00039%26rft_id%253Dinfo%253Apmid%252F21629826%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [39]: /lookup/external-ref?access_num=10.3389/fnhum.2011.00039&link_type=DOI [40]: /lookup/external-ref?access_num=21629826&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [41]: #xref-ref-8-1 "View reference 8 in text" [42]: {openurl}?query=rft.jtitle%253DPLoS%2Bbiology%26rft.stitle%253DPLoS%2BBiol%26rft.aulast%253DCollins%26rft.auinit1%253DA.%26rft.volume%253D10%26rft.issue%253D3%26rft.spage%253De1001293%26rft.epage%253De1001293%26rft.atitle%253DReasoning%252C%2Blearning%252C%2Band%2Bcreativity%253A%2Bfrontal%2Blobe%2Bfunction%2Band%2Bhuman%2Bdecision-making.%26rft_id%253Dinfo%253Adoi%252F10.1371%252Fjournal.pbio.1001293%26rft_id%253Dinfo%253Apmid%252F22479152%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [43]: /lookup/external-ref?access_num=10.1371/journal.pbio.1001293&link_type=DOI [44]: /lookup/external-ref?access_num=22479152&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [45]: #xref-ref-9-1 "View reference 9 in text" [46]: {openurl}?query=rft.jtitle%253DScience%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.1252254%26rft_id%253Dinfo%253Apmid%252F24876345%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [47]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNDQvNjE5MS8xNDgxIjtzOjQ6ImF0b20iO3M6MjM6Ii9zY2kvMzY5LzY1MDcvMTA1Ni5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30= [48]: #xref-ref-10-1 "View reference 10 in text" [49]: {openurl}?query=rft.jtitle%253DPhilos.%2BTrans.%2BR.%2BSoc.%2BLondon%2BSer.%2BB%26rft_id%253Dinfo%253Adoi%252F10.1098%252Frstb.2013.0478%26rft_id%253Dinfo%253Apmid%252F25267820%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [50]: /lookup/external-ref?access_num=10.1098/rstb.2013.0478&link_type=DOI [51]: /lookup/external-ref?access_num=25267820&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom


Document-editing Assistants and Model-based Reinforcement Learning as a Path to Conversational AI

arXiv.org Artificial Intelligence

Intelligent assistants that follow commands or answer simple questions, such as Siri and Google search, are among the most economically important applications of AI. Future conversational AI assistants promise even greater capabilities and a better user experience through a deeper understanding of the domain, the user, or the user's purposes. But what domain and what methods are best suited to researching and realizing this promise? In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning. The primary advantages of voice document editing are that the domain is tightly scoped and that it provides something for the conversation to be about (the document) that is delimited and fully accessible to the intelligent assistant. The advantages of reinforcement learning in general are that its methods are designed to learn from interaction without explicit instruction and that it formalizes the purposes of the assistant. Model-based reinforcement learning is needed in order to genuinely understand the domain of discourse and thereby work efficiently with the user to achieve their goals. Together, voice document editing and model-based reinforcement learning comprise a promising research direction for achieving conversational AI.


Dynamic Models Applied to Value Learning in Artificial Intelligence

arXiv.org Artificial Intelligence

Experts in Artificial Intelligence (AI) development predict that advances in the development of intelligent systems and agents will reshape vital areas in our society. Nevertheless, if such an advance is not made prudently and critically-reflexively, it can result in negative outcomes for humanity. For this reason, several researchers in the area are trying to develop a robust, beneficial, and safe concept of AI for the preservation of humanity and the environment. Currently, several of the open problems in the field of AI research arise from the difficulty of avoiding unwanted behaviors of intelligent agents and systems, and at the same time specifying what we want such systems to do, especially when we look for the possibility of intelligent agents acting in several domains over the long term. It is of utmost importance that artificial intelligent agents have their values aligned with human values, given the fact that we cannot expect an AI to develop human moral values simply because of its intelligence, as discussed in the Orthogonality Thesis. Perhaps this difficulty comes from the way we are addressing the problem of expressing objectives, values, and ends, using representational cognitive methods. A solution to this problem would be the dynamic approach proposed by Dreyfus, whose phenomenological philosophy shows that the human experience of being-in-the-world in several aspects is not well represented by the symbolic or connectionist cognitive method, especially in regards to the question of learning values. A possible approach to this problem would be to use theoretical models such as SED (situated embodied dynamics) to address the values learning problem in AI.


Semi-supervised Learning with the EM Algorithm: A Comparative Study between Unstructured and Structured Prediction

arXiv.org Machine Learning

Semi-supervised learning aims to learn prediction models from both labeled and unlabeled samples. There has been extensive research in this area. Among existing work, generative mixture models with Expectation-Maximization (EM) is a popular method due to clear statistical properties. However, existing literature on EM-based semi-supervised learning largely focuses on unstructured prediction, assuming that samples are independent and identically distributed. Studies on EM-based semi-supervised approach in structured prediction is limited. This paper aims to fill the gap through a comparative study between unstructured and structured methods in EM-based semi-supervised learning. Specifically, we compare their theoretical properties and find that both methods can be considered as a generalization of self-training with soft class assignment of unlabeled samples, but the structured method additionally considers structural constraint in soft class assignment. We conducted a case study on real-world flood mapping datasets to compare the two methods. Results show that structured EM is more robust to class confusion caused by noise and obstacles in features in the context of the flood mapping application.


Decision-making for Autonomous Vehicles on Highway: Deep Reinforcement Learning with Continuous Action Horizon

arXiv.org Artificial Intelligence

Decision-making strategy for autonomous vehicles de-scribes a sequence of driving maneuvers to achieve a certain navigational mission. This paper utilizes the deep reinforcement learning (DRL) method to address the continuous-horizon decision-making problem on the highway. First, the vehicle kinematics and driving scenario on the freeway are introduced. The running objective of the ego automated vehicle is to execute an efficient and smooth policy without collision. Then, the particular algorithm named proximal policy optimization (PPO)-enhanced DRL is illustrated. To overcome the challenges in tardy training efficiency and sample inefficiency, this applied algorithm could realize high learning efficiency and excellent control performance. Finally, the PPO-DRL-based decision-making strategy is estimated from multiple perspectives, including the optimality, learning efficiency, and adaptability. Its potential for online application is discussed by applying it to similar driving scenarios.


Reputation-driven Decision-making in Networks of Stochastic Agents

arXiv.org Artificial Intelligence

This paper studies multi-agent systems that involve networks of self-interested agents. We propose a Markov Decision Process-derived framework, called RepNet-MDP, tailored to domains in which agent reputation is a key driver of the interactions between agents. The fundamentals are based on the principles of RepNet-POMDP, a framework developed by Rens et al. [11] in 2018, but addresses its mathematical inconsistencies and alleviates its intractability by only considering fully observable environments. We furthermore use an online learning algorithm for finding approximate solutions to RepNet-MDPs. In a series of experiments, RepNet agents are shown to be able to adapt their own behavior to the past behavior and reliability of the remaining agents of the network. Finally, our work identifies a limitation of the framework in its current formulation that prevents its agents from learning in circumstances in which they are not a primary actor.


Constrained Markov Decision Processes via Backward Value Functions

arXiv.org Machine Learning

Although Reinforcement Learning (RL) algorithms have found tremendous success in simulated domains, they often cannot directly be applied to physical systems, especially in cases where there are hard constraints to satisfy (e.g. on safety or resources). In standard RL, the agent is incentivized to explore any behavior as long as it maximizes rewards, but in the real world, undesired behavior can damage either the system or the agent in a way that breaks the learning process itself. In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. Through this, we define a safe policy improvement method which maximizes returns while ensuring that the constraints are satisfied at every step. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training. We also highlight the computational advantages of this approach. The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of MuJoCo environments, with deep neural networks.


Imitative Planning using Conditional Normalizing Flow

arXiv.org Artificial Intelligence

We explore the application of normalizing flows for improving the performance of trajectory planning for autonomous vehicles (AVs). Normalizing flows provide an invertible mapping from a known prior distribution to a potentially complex, multi-modal target distribution and allow for fast sampling with exact PDF inference. By modeling a trajectory planner's cost manifold as an energy function we learn a scene conditioned mapping from the prior to a Boltzmann distribution over the AV control space. This mapping allows for control samples and their associated energy to be generated jointly and in parallel. We propose using neural autoregressive flow (NAF) as part of an end-to-end deep learned system that allows for utilizing sensors, map, and route information to condition the flow mapping. Finally, we demonstrate the effectiveness of our approach on real world datasets over IL and hand constructed trajectory sampling techniques.