AITopics

2008.12882

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Jin, Yujia, Sidford, Aaron

Efficiently Solving MDPs with Stochastic Mirror Descent

arXiv.org Machine LearningAug-28-2020

Markov decision processes (MDPs) are a fundamental mathematical abstraction for sequential decision making under uncertainty and they serve as a basic modeling tool in reinforcement learning (RL) and stochastic control [5, 24, 30]. Two prominent classes of MDPs are average-reward MDPs (AMDPs) and discounted MDPs (DMDPs). Each have been studied extensively; AMDPs are applicable to optimal control, learning automata, and various real-world reinforcement learning settings [17, 3, 22] and DMDPs have a number of nice theoretical properties including reward convergence and operator monotonicity [6]. In this paper we consider the prevalent computational learning problem of finding an approximately optimal policy of an MDP given only restricted access to the model. In particular, we consider the problem of computing an ɛ-optimal policy, i.e. a policy with an additive ɛ error in expected cumulative reward over infinite horizon, under the standard assumption of a generative model [14, 13], which allows one to sample from state-transitions given the current state-action pair. This problem is well-studied and there are multiple known upper and lower bounds on its sample complexity [4, 32, 28, 31]. In this work, we provide a unified framework based on primal-dual stochastic mirror descent (SMD) for learning an ɛ-optimal policies for both AMDPs and DMDPs with a generative model.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

2008.12776

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.64)

Industry:

Education (0.48)
Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

ScienceAug-27-2020, 17:40:24 GMT

Strategies for navigating a dynamic world

One of the most difficult problems for an adaptable agent is gauging how to behave in a nonstationary environment. When conditions are stable, an organism generally pursues a strategy known to provide the best outcome. However, when environmental conditions change, an organism abandons the current action plan and searches for a new best option. The most challenging aspect of this search—calculating the exact time point at which to change strategies—requires the brain to integrate past and present observations and evaluate whether they remain consistent with current environmental conditions. On page 1076 of this issue, Domenech et al. ([ 1 ][1]) report on the modeling of rare direct electrical recordings from the prefrontal cortices (PFCs) of a small group of human epilepsy patients as they flexibly negotiated a nonstationary environment. To understand the brain's mode of navigation, consider for example a sailor at sea (see the figure). The winds and the currents determine the waves that drive the sailor to continuously adjust the rudder so as to stay on course. By observing the wave patterns, he can anticipate the navigational effects of his actions and adapt accordingly. But when the currents or the weather changes, the sailor must adapt his course to reach the next port of call. At that time, the sailor observes essentially the same stimulus (the waves) but has to remap his action plan (rudder adjustments) to the new wind conditions and currents. This difficult-decision problem—how to detect and then adapt to a nonstationary environment—is captured perfectly in the exploration-exploitation dilemma: When should I stop exploiting my current action plan and start exploring different ways to reach my goals? An optimal solution tracks the discounted sum of normalized future rewards. However, this approach applies strictly to stationary environments and thus does not capture the dynamic changes that organisms encounter in their daily lives ([ 2 ][2]). Yet the human brain and those of other species seem to smoothly solve the exploration-exploitation dilemma in nonstationary environments. Decision neuroscience has investigated the flexible adaptation to changing environmental contingencies with diverse experimental paradigms and assorted computational models. The simplest paradigm is probabilistic reversal learning, in which the agent has to search for reward among two options with complementary reward probabilities. This adaptation problem can be solved by hidden Markov models ([ 3 ][3]), which are well-approximated by reinforcement learning (RL) models that also update nonchosen actions ([ 4 ][4]). Extension of this paradigm to include independently changing reward probabilities reveals two distinct neural responses: Expected-value signals, which reflect “exploitative” choices, spur activation of the ventromedial prefrontal cortex (vmPFC); and “explorative” choices (that is, the choosing of a currently lesser valued option) activate the frontopolar cortex ([ 5 ][5]). ![Figure][6] A sailor solves a dilemma at sea As the ship nears bad weather, the sailor's ventromedial prefrontal cortex (vmPFC) evaluates the ongoing (orange) action plan (exploitation) and the prospective (brown, red) plans (exploration). Once the red (calm waters) plan is exploited, the sailor's dorsomedial PFC (dmPFC) uses trial-and-error learning to map the proper rudder adjustments. GRAPHIC: A. KITTERMAN/ SCIENCE Another task with both rapid and slow changes in the reward probabilities of various options was used to develop a hierarchical Bayesian model that estimates the volatility of the environment and adjusts the learning rate accordingly ([ 6 ][7]). This model has found its generalization in the hierarchical Gaussian filter (HGF) framework ([ 7 ][8]), which is widely used in modeling social and nonsocial human decision-making in nonstationary environments. Although these computational modeling frameworks differ, all are trying to solve similar problems: How to infer the latent structure of the world from discrete observations and how to detect transitions between different states of the world. Domenech et al. address the same problems with yet another experimental paradigm, this one carried out with a small group of human epilepsy patients. Electrodes deeply implanted in the patients' PFCs delivered direct electrical recordings from the vmPFC and dorsomedial PFC (dmPFC) while the patients performed a multioption decision task. The participants had to associate three different stimuli with three distinct actions, thus constituting an action plan. The mapping changed every 33 to 57 trials, and participants had to relearn the association of the same stimuli with a different combination of actions, much like our sailor at sea who faces changes in weather and currents that alter wave patterns. The computational model ([ 8 ][9]) generates a reliability value for the ongoing action plan and other concurrently monitored plans. When the ongoing action plan is deemed reliable, the model is in “exploitation” mode and learns the stimulus-action mapping through RL mechanisms. When the ongoing action plan is deemed unreliable, the model switches to “exploration” mode. New provisional action plans are created and evaluated, until one emerges as a reliable predictor for successful stimulus-action mapping (see the figure). Using a state-of-the-art model-based analysis that associates the model-derived variables with the brain activity in various frequency bands of the neural recordings, the authors found a delicate interplay between the vmPFC and dmPFC that supports a predictive coding interpretation for resolution of the exploration-exploitation dilemma. vmPFC monitors and represents the reliability of the ongoing action plan. vmPFC relays the ongoing action plan to the dmPFC as either a “stay” or “switch” trial. A stay trial triggers additional learning through RL mechanisms in the dmPFC. In contrast, the dmPFC responds to a switch trial by suppressing activity related to maintaining the ongoing action plan. These findings resonate with and extend earlier results obtained with functional neuroimaging ([ 5 ][5], [ 9 ][10]). These computational approaches to the problem of behavioral flexibility in a nonstationary environment share one commonality: They are all building a model of the environment and the transition therein, either explicitly (as in the HGF framework) or implicitly (by evaluating the ongoing action plan, as in the Domenech et al. study). Although all of these models strive for generality, each was developed for a specific experimental context. It remains to be seen which of these provides the best account of flexible decision-making in humans and other species, preferably using a unified experimental paradigm. A model-free RL account ([ 10 ][11]) likely will not suffice, as several studies have demonstrated the superiority of more-complex models over this “vanilla” RL model. Rather, an agent requires a rich representation of the environment and its dynamic transitions (often referred to as model-based learning) ([ 10 ][11]) to solve the exploration-exploitation dilemma and flexibly respond to a changing world. 1. [↵][12]1. P. Domenech, 2. S. Rheims, 3. E. Koechlin , Science 369, eabb0184 (2020). [OpenUrl][13][CrossRef][14] 2. [↵][15]1. J. D. Cohen, 2. S. M. McClure, 3. A. J. Yu , Philos. Trans. R. Soc. London Ser. B 362, 933 (2007). [OpenUrl][16][CrossRef][17][PubMed][18] 3. [↵][19]1. A. N. Hampton, 2. P. Bossaerts, 3. J. P. O'Doherty , J. Neurosci. 26, 8360 (2006). [OpenUrl][20][Abstract/FREE Full Text][21] 4. [↵][22]1. J. Gläscher, 2. A. N. Hampton, 3. J. P. O'Doherty , Cereb. Cortex 19, 483 (2009). [OpenUrl][23][CrossRef][24][PubMed][25][Web of Science][26] 5. [↵][27]1. N. D. Daw, 2. J. P. O'Doherty, 3. P. Dayan, 4. B. Seymour, 5. R. J. Dolan , Nature 441, 876 (2006). [OpenUrl][28][CrossRef][29][PubMed][30][Web of Science][31] 6. [↵][32]1. T. E. J. Behrens, 2. M. W. Woolrich, 3. M. E. Walton, 4. M. F. S. Rushworth , Nat. Neurosci. 10, 1214 (2007). [OpenUrl][33][CrossRef][34][PubMed][35][Web of Science][36] 7. [↵][37]1. C. Mathys, 2. J. Daunizeau, 3. K. J. Friston, 4. K. E. Stephan , Front. Hum. Neurosci. 5, 39 (2011). [OpenUrl][38][CrossRef][39][PubMed][40] 8. [↵][41]1. A. Collins, 2. E. Koechlin , PLOS Biol. 10, e1001293 (2012). [OpenUrl][42][CrossRef][43][PubMed][44] 9. [↵][45]1. M. Donoso, 2. A. G. E. Collins, 3. E. Koechlin , Science 344, 1481 (2014). [OpenUrl][46][Abstract/FREE Full Text][47] 10. [↵][48]1. N. D. Daw, 2. P. Dayan , Philos. Trans. R. Soc. London Ser. B 369, 20130478 (2014). [OpenUrl][49][CrossRef][50][PubMed][51] [1]: #ref-1 [2]: #ref-2 [3]: #ref-3 [4]: #ref-4 [5]: #ref-5 [6]: pending:yes [7]: #ref-6 [8]: #ref-7 [9]: #ref-8 [10]: #ref-9 [11]: #ref-10 [12]: #xref-ref-1-1 "View reference 1 in text" [13]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DDomenech%26rft.auinit1%253DP.%26rft.volume%253D369%26rft.issue%253D6507%26rft.spage%253Deabb0184%26rft.epage%253Deabb0184%26rft.atitle%253DNeural%2Bmechanisms%2Bresolving%2Bexploitation-exploration%2Bdilemmas%2Bin%2Bthe%2Bmedial%2Bprefrontal%2Bcortex%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.abb0184%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [14]: /lookup/external-ref?access_num=10.1126/science.abb0184&link_type=DOI [15]: #xref-ref-2-1 "View reference 2 in text" [16]: {openurl}?query=rft.jtitle%253DPhilosophical%2BTransactions%2Bof%2Bthe%2BRoyal%2BSociety%2BB%253A%2BBiological%2BSciences%26rft.stitle%253DPhil%2BTrans%2BR%2BSoc%2BB%26rft.aulast%253DCohen%26rft.auinit1%253DJ.%2BD%26rft.volume%253D362%26rft.issue%253D1481%26rft.spage%253D933%26rft.epage%253D942%26rft.atitle%253DShould%2BI%2Bstay%2Bor%2Bshould%2BI%2Bgo%253F%2BHow%2Bthe%2Bhuman%2Bbrain%2Bmanages%2Bthe%2Btrade-off%2Bbetween%2Bexploitation%2Band%2Bexploration%26rft_id%253Dinfo%253Adoi%252F10.1098%252Frstb.2007.2098%26rft_id%253Dinfo%253Apmid%252F17395573%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [17]: /lookup/external-ref?access_num=10.1098/rstb.2007.2098&link_type=DOI [18]: /lookup/external-ref?access_num=17395573&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [19]: #xref-ref-3-1 "View reference 3 in text" [20]: {openurl}?query=rft.jtitle%253DJournal%2Bof%2BNeuroscience%26rft.stitle%253DJ.%2BNeurosci.%26rft.aulast%253DHampton%26rft.auinit1%253DA.%2BN.%26rft.volume%253D26%26rft.issue%253D32%26rft.spage%253D8360%26rft.epage%253D8367%26rft.atitle%253DThe%2BRole%2Bof%2Bthe%2BVentromedial%2BPrefrontal%2BCortex%2Bin%2BAbstract%2BState-Based%2BInference%2Bduring%2BDecision%2BMaking%2Bin%2BHumans%26rft_id%253Dinfo%253Adoi%252F10.1523%252FJNEUROSCI.1010-06.2006%26rft_id%253Dinfo%253Apmid%252F16899731%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [21]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Njoiam5ldXJvIjtzOjU6InJlc2lkIjtzOjEwOiIyNi8zMi84MzYwIjtzOjQ6ImF0b20iO3M6MjM6Ii9zY2kvMzY5LzY1MDcvMTA1Ni5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30= [22]: #xref-ref-4-1 "View reference 4 in text" [23]: {openurl}?query=rft.jtitle%253DCereb.%2BCortex%26rft_id%253Dinfo%253Adoi%252F10.1093%252Fcercor%252Fbhn098%26rft_id%253Dinfo%253Apmid%252F18550593%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [24]: /lookup/external-ref?access_num=10.1093/cercor/bhn098&link_type=DOI [25]: /lookup/external-ref?access_num=18550593&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [26]: /lookup/external-ref?access_num=000262518800023&link_type=ISI [27]: #xref-ref-5-1 "View reference 5 in text" [28]: {openurl}?query=rft.jtitle%253DNature%26rft.stitle%253DNature%26rft.aulast%253DDaw%26rft.auinit1%253DN.%2BD.%26rft.volume%253D441%26rft.issue%253D7095%26rft.spage%253D876%26rft.epage%253D879%26rft.atitle%253DCortical%2Bsubstrates%2Bfor%2Bexploratory%2Bdecisions%2Bin%2Bhumans.%26rft_id%253Dinfo%253Adoi%252F10.1038%252Fnature04766%26rft_id%253Dinfo%253Apmid%252F16778890%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [29]: /lookup/external-ref?access_num=10.1038/nature04766&link_type=DOI [30]: /lookup/external-ref?access_num=16778890&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [31]: /lookup/external-ref?access_num=000238254100043&link_type=ISI [32]: #xref-ref-6-1 "View reference 6 in text" [33]: {openurl}?query=rft.jtitle%253DNature%2Bneuroscience%26rft.stitle%253DNat%2BNeurosci%26rft.aulast%253DBehrens%26rft.auinit1%253DT.%2BE.%26rft.volume%253D10%26rft.issue%253D9%26rft.spage%253D1214%26rft.epage%253D1221%26rft.atitle%253DLearning%2Bthe%2Bvalue%2Bof%2Binformation%2Bin%2Ban%2Buncertain%2Bworld.%26rft_id%253Dinfo%253Adoi%252F10.1038%252Fnn1954%26rft_id%253Dinfo%253Apmid%252F17676057%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [34]: /lookup/external-ref?access_num=10.1038/nn1954&link_type=DOI [35]: /lookup/external-ref?access_num=17676057&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [36]: /lookup/external-ref?access_num=000249144000025&link_type=ISI [37]: #xref-ref-7-1 "View reference 7 in text" [38]: {openurl}?query=rft.stitle%253DFront%2BHum%2BNeurosci%26rft.aulast%253DMathys%26rft.auinit1%253DC.%26rft.volume%253D5%26rft.spage%253D39%26rft.epage%253D39%26rft.atitle%253DA%2Bbayesian%2Bfoundation%2Bfor%2Bindividual%2Blearning%2Bunder%2Buncertainty.%26rft_id%253Dinfo%253Adoi%252F10.3389%252Ffnhum.2011.00039%26rft_id%253Dinfo%253Apmid%252F21629826%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [39]: /lookup/external-ref?access_num=10.3389/fnhum.2011.00039&link_type=DOI [40]: /lookup/external-ref?access_num=21629826&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [41]: #xref-ref-8-1 "View reference 8 in text" [42]: {openurl}?query=rft.jtitle%253DPLoS%2Bbiology%26rft.stitle%253DPLoS%2BBiol%26rft.aulast%253DCollins%26rft.auinit1%253DA.%26rft.volume%253D10%26rft.issue%253D3%26rft.spage%253De1001293%26rft.epage%253De1001293%26rft.atitle%253DReasoning%252C%2Blearning%252C%2Band%2Bcreativity%253A%2Bfrontal%2Blobe%2Bfunction%2Band%2Bhuman%2Bdecision-making.%26rft_id%253Dinfo%253Adoi%252F10.1371%252Fjournal.pbio.1001293%26rft_id%253Dinfo%253Apmid%252F22479152%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [43]: /lookup/external-ref?access_num=10.1371/journal.pbio.1001293&link_type=DOI [44]: /lookup/external-ref?access_num=22479152&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom [45]: #xref-ref-9-1 "View reference 9 in text" [46]: {openurl}?query=rft.jtitle%253DScience%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.1252254%26rft_id%253Dinfo%253Apmid%252F24876345%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [47]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNDQvNjE5MS8xNDgxIjtzOjQ6ImF0b20iO3M6MjM6Ii9zY2kvMzY5LzY1MDcvMTA1Ni5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30= [48]: #xref-ref-10-1 "View reference 10 in text" [49]: {openurl}?query=rft.jtitle%253DPhilos.%2BTrans.%2BR.%2BSoc.%2BLondon%2BSer.%2BB%26rft_id%253Dinfo%253Adoi%252F10.1098%252Frstb.2013.0478%26rft_id%253Dinfo%253Apmid%252F25267820%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [50]: /lookup/external-ref?access_num=10.1098/rstb.2013.0478&link_type=DOI [51]: /lookup/external-ref?access_num=25267820&link_type=MED&atom=%2Fsci%2F369%2F6507%2F1056.atom

artificial intelligence, dynamic world, machine learning, (1 more...)

Science

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Neurology > Epilepsy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.87)

Kudashkina, Katya, Pilarski, Patrick M., Sutton, Richard S.

Document-editing Assistants and Model-based Reinforcement Learning as a Path to Conversational AI

arXiv.org Artificial IntelligenceAug-27-2020

Intelligent assistants that follow commands or answer simple questions, such as Siri and Google search, are among the most economically important applications of AI. Future conversational AI assistants promise even greater capabilities and a better user experience through a deeper understanding of the domain, the user, or the user's purposes. But what domain and what methods are best suited to researching and realizing this promise? In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning. The primary advantages of voice document editing are that the domain is tightly scoped and that it provides something for the conversation to be about (the document) that is delimited and fully accessible to the intelligent assistant. The advantages of reinforcement learning in general are that its methods are designed to learn from interaction without explicit instruction and that it formalizes the purposes of the assistant. Model-based reinforcement learning is needed in order to genuinely understand the domain of discourse and thereby work efficiently with the user to achieve their goals. Together, voice document editing and model-based reinforcement learning comprise a promising research direction for achieving conversational AI.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2008.12095

Country:

North America > Canada > Alberta (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > France (0.04)
(23 more...)

Genre: Research Report (0.40)

Industry:

Information Technology > Services (1.00)
Health & Medicine (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Corrêa, Nicholas Kluge, de Oliveira, Nythamar

Dynamic Models Applied to Value Learning in Artificial Intelligence

arXiv.org Artificial IntelligenceAug-27-2020

Experts in Artificial Intelligence (AI) development predict that advances in the development of intelligent systems and agents will reshape vital areas in our society. Nevertheless, if such an advance is not made prudently and critically-reflexively, it can result in negative outcomes for humanity. For this reason, several researchers in the area are trying to develop a robust, beneficial, and safe concept of AI for the preservation of humanity and the environment. Currently, several of the open problems in the field of AI research arise from the difficulty of avoiding unwanted behaviors of intelligent agents and systems, and at the same time specifying what we want such systems to do, especially when we look for the possibility of intelligent agents acting in several domains over the long term. It is of utmost importance that artificial intelligent agents have their values aligned with human values, given the fact that we cannot expect an AI to develop human moral values simply because of its intelligence, as discussed in the Orthogonality Thesis. Perhaps this difficulty comes from the way we are addressing the problem of expressing objectives, values, and ends, using representational cognitive methods. A solution to this problem would be the dynamic approach proposed by Dreyfus, whose phenomenological philosophy shows that the human experience of being-in-the-world in several aspects is not well represented by the symbolic or connectionist cognitive method, especially in regards to the question of learning values. A possible approach to this problem would be to use theoretical models such as SED (situated embodied dynamics) to address the values learning problem in AI.

agent, artificial intelligence, machine learning, (14 more...)

doi: 10.13140/RG.2.2.35369.01126/2

2005.05538

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(11 more...)

Genre: Research Report (0.64)

Industry:

Law (0.93)
Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

arXiv.org Machine LearningAug-27-2020

Semi-supervised Learning with the EM Algorithm: A Comparative Study between Unstructured and Structured Prediction

He, Wenchong, Jiang, Zhe

Semi-supervised learning aims to learn prediction models from both labeled and unlabeled samples. There has been extensive research in this area. Among existing work, generative mixture models with Expectation-Maximization (EM) is a popular method due to clear statistical properties. However, existing literature on EM-based semi-supervised learning largely focuses on unstructured prediction, assuming that samples are independent and identically distributed. Studies on EM-based semi-supervised approach in structured prediction is limited. This paper aims to fill the gap through a comparative study between unstructured and structured methods in EM-based semi-supervised learning. Specifically, we compare their theoretical properties and find that both methods can be considered as a generalization of self-training with soft class assignment of unlabeled samples, but the structured method additionally considers structural constraint in soft class assignment. We conducted a case study on real-world flood mapping datasets to compare the two methods. Results show that structured EM is more robust to class confusion caused by noise and obstacles in features in the context of the flood mapping application.

artificial intelligence, inductive learning, machine learning, (18 more...)

2008.12442

Country:

North America > United States > Alabama > Tuscaloosa County > Tuscaloosa (0.14)
North America > United States > Texas (0.05)
North America > United States > North Carolina (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.30)

arXiv.org Artificial IntelligenceAug-26-2020

Decision-making for Autonomous Vehicles on Highway: Deep Reinforcement Learning with Continuous Action Horizon

Liu, Teng, Wang, Hong, Lu, Bing, Li, Jun, Cao, Dongpu

Decision-making strategy for autonomous vehicles de-scribes a sequence of driving maneuvers to achieve a certain navigational mission. This paper utilizes the deep reinforcement learning (DRL) method to address the continuous-horizon decision-making problem on the highway. First, the vehicle kinematics and driving scenario on the freeway are introduced. The running objective of the ego automated vehicle is to execute an efficient and smooth policy without collision. Then, the particular algorithm named proximal policy optimization (PPO)-enhanced DRL is illustrated. To overcome the challenges in tardy training efficiency and sample inefficiency, this applied algorithm could realize high learning efficiency and excellent control performance. Finally, the PPO-DRL-based decision-making strategy is estimated from multiple perspectives, including the optimality, learning efficiency, and adaptability. Its potential for online application is discussed by applying it to similar driving scenarios.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2008.11852

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.14)
Asia > China > Beijing > Beijing (0.06)
Asia > China > Chongqing Province > Chongqing (0.05)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Maoujoud, David, Rens, Gavin

Reputation-driven Decision-making in Networks of Stochastic Agents

arXiv.org Artificial IntelligenceAug-26-2020

This paper studies multi-agent systems that involve networks of self-interested agents. We propose a Markov Decision Process-derived framework, called RepNet-MDP, tailored to domains in which agent reputation is a key driver of the interactions between agents. The fundamentals are based on the principles of RepNet-POMDP, a framework developed by Rens et al. [11] in 2018, but addresses its mathematical inconsistencies and alleviates its intractability by only considering fully observable environments. We furthermore use an online learning algorithm for finding approximate solutions to RepNet-MDPs. In a series of experiments, RepNet agents are shown to be able to adapt their own behavior to the past behavior and reliability of the remaining agents of the network. Finally, our work identifies a limitation of the framework in its current formulation that prevents its agents from learning in circumstances in which they are not a primary actor.

agent, artificial intelligence, machine learning, (16 more...)

2008.11791

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.89)

Satija, Harsh, Amortila, Philip, Pineau, Joelle

Constrained Markov Decision Processes via Backward Value Functions

arXiv.org Machine LearningAug-26-2020

Although Reinforcement Learning (RL) algorithms have found tremendous success in simulated domains, they often cannot directly be applied to physical systems, especially in cases where there are hard constraints to satisfy (e.g. on safety or resources). In standard RL, the agent is incentivized to explore any behavior as long as it maximizes rewards, but in the real world, undesired behavior can damage either the system or the agent in a way that breaks the learning process itself. In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. Through this, we define a safe policy improvement method which maximizes returns while ensuring that the constraints are satisfied at every step. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training. We also highlight the computational advantages of this approach. The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of MuJoCo environments, with deep neural networks.

constraint, machine learning, reinforcement learning, (13 more...)

2008.11811

Country:

North America > Canada > Quebec > Montreal (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Agarwal, Shubhankar, Sikchi, Harshit, Gulino, Cole, Wilkinson, Eric

Imitative Planning using Conditional Normalizing Flow

arXiv.org Artificial IntelligenceAug-25-2020

We explore the application of normalizing flows for improving the performance of trajectory planning for autonomous vehicles (AVs). Normalizing flows provide an invertible mapping from a known prior distribution to a potentially complex, multi-modal target distribution and allow for fast sampling with exact PDF inference. By modeling a trajectory planner's cost manifold as an energy function we learn a scene conditioned mapping from the prior to a Boltzmann distribution over the AV control space. This mapping allows for control samples and their associated energy to be generated jointly and in parallel. We propose using neural autoregressive flow (NAF) as part of an end-to-end deep learned system that allows for utilizing sensors, map, and route information to condition the flow mapping. Finally, we demonstrate the effectiveness of our approach on real world datasets over IL and hand constructed trajectory sampling techniques.

machine learning, reinforcement learning, trajectory, (19 more...)

2007.16162

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre:

Research Report (0.64)
Instructional Material > Course Syllabus & Notes (0.46)

Industry: Transportation > Ground > Road (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
(2 more...)