AITopics | bandit task

Predictive Coding Enhances Meta-RLTo Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability

Neural Information Processing SystemsJun-15-2026, 19:32:40 GMT

Learning a compact representation of history is critical for planning and generalization in partially observable environments. While meta-reinforcement learning (RL) agents can attain near Bayes-optimal policies, they often fail to learn the compact, interpretable Bayes-optimal belief states. This representational inefficiency potentially limits the agent's adaptability and generalization capacity. Inspired by predictive coding in neuroscience--which suggests that the brain predicts sensory inputs as a neural implementation of Bayesian inference--and by auxiliary predictive objectives in deep RL, we investigate whether integrating self-supervised predictive coding modules into meta-RL can facilitate learning of Bayes-optimal representations. Through state machine simulation, we show that meta-RL with predictive modules consistently generates more interpretable representations that better approximate Bayes-optimal belief states compared to conventional meta-RL across a wide variety of tasks, even when both achieve optimal policies. In challenging tasks requiring active information seeking, only meta-RL with predictive modules successfully learns optimal representations and policies, whereas conventional meta-RL struggles with inadequate representation learning. Finally, we demonstrate that better representation learning leads to improved generalization. Our results strongly suggest the role of predictive learning as a guiding principle for effective representation learning in agents navigating partial observability.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.88)
Law > Litigation (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(2 more...)

Add feedback

Provably Efficient Multi-Task Meta Bandit Learning via Shared Representations

Neural Information Processing SystemsJun-13-2026, 09:27:10 GMT

Learning-to-learn or meta-learning focuses on developing algorithms that leverage prior experience to quickly acquire new skills or adapt to novel environments. A crucial component of meta-learning is representation learning, which aims to construct data representations capable of transferring knowledge across multiple tasks--a critical advantage in data-scarce settings. We study how representation learning can improve the efficiency of bandit problems. We consider $T$ $d$-dimensional linear bandits that share a common low-dimensional linear representation. We provide provably fast, sample-efficient algorithms to address the two key problems in meta-learning: (1) learning a common set of features from multiple related bandit tasks and (2) transferring this knowledge to new, unseen bandit tasks.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata

Neural Information Processing SystemsApr-30-2026, 02:27:55 GMT

In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts.

large language model, natural language, persona, (18 more...)

Neural Information Processing Systems

Country: North America (0.47)

Genre: Research Report > New Finding (1.00)

Industry:

Automobiles & Trucks (0.93)
Education (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Neural Information Processing SystemsMar-17-2026, 02:37:20 GMT

How humans make repeated choices among options with imperfectly known reward outcomes is an important problem in psychology and neuroscience. This is often studied using multi-armed bandits, which is also frequently studied in machine learning. We present data from a human stationary bandit experiment, in which we vary the average abundance and variability of reward availability (mean and variance of reward rate distributions). Surprisingly, we find subjects significantly underestimate prior mean of reward rates -- based on their self-report, at the end of a game, on their reward expectation of non-chosen arms. Previously, human learning in the bandit task was found to be well captured by a Bayesian ideal learning model, the Dynamic Belief Model (DBM), albeit under an incorrect generative assumption of the temporal structure - humans assume reward rates can change over time even though they are actually fixed. We find that the pessimism bias in the bandit task is well captured by the prior mean of DBM when fitted to human choices; but it is poorly captured by the prior mean of the Fixed Belief Model (FBM), an alternative Bayesian model that (correctly) assumes reward rates to be constants. This pessimism bias is also incompletely captured by a simple reinforcement learning model (RL) commonly used in neuroscience and psychology, in terms of fitted initial Q-values. While it seems sub-optimal, and thus mysterious, that humans have an underestimated prior reward expectation, our simulations show that an underestimated prior mean helps to maximize long-term gain, if the observer assumes volatility when reward rates are stable and utilizes a softmax decision policy instead of the optimal one (obtainable by dynamic programming). This raises the intriguing possibility that the brain underestimates reward rates to compensate for the incorrect non-stationarity assumption in the generative model and a simplified decision policy.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata

Neural Information Processing SystemsFeb-17-2026, 15:43:37 GMT

In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context.

large language model, machine learning, persona, (20 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Improved Bayes Regret Bounds for Multi-Task Hierarchical Bayesian Bandit Algorithms Jiechao Guan 1 Hui Xiong 1,2, 1 AI Thrust, The Hong Kong University of Science and Technology (Guangzhou), China

Neural Information Processing SystemsFeb-16-2026, 08:13:58 GMT

Hierarchical Bayesian bandit refers to the multi-task bandit problem in which bandit tasks are assumed to be drawn from the same distribution.

bayes regret, data mining, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province > Guangzhou (0.40)
Asia > China > Hong Kong (0.40)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (0.92)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Dalin Guo, Angela J. Yu

Neural Information Processing SystemsFeb-15-2026, 07:07:36 GMT

Neural Information Processing Systems http://nips.cc/

dbm, probability, reward rate, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > San Diego County > La Jolla (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.47)

Industry: Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Data Science > Data Mining > Big Data (0.65)

Add feedback

d902c3ce47124c66ce615d5ad9ba304f-Supplemental.pdf

Neural Information Processing SystemsFeb-10-2026, 16:00:36 GMT

agent, dirichlet, prediction task, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

d902c3ce47124c66ce615d5ad9ba304f-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 16:00:28 GMT

agent, arxiv preprint arxiv, bayes-optimal agent, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Reading (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Neural Information Processing SystemsNov-20-2025, 23:16:00 GMT

How humans make repeated choices among options with imperfectly known reward outcomes is an important problem in psychology and neuroscience. This is often studied using multi-armed bandits, which is also frequently studied in machine learning. We present data from a human stationary bandit experiment, in which we vary the average abundance and variability of reward availability (mean and variance of reward rate distributions). Surprisingly, we find subjects significantly underestimate prior mean of reward rates -- based on their self-report, at the end of a game, on their reward expectation of non-chosen arms. Previously, human learning in the bandit task was found to be well captured by a Bayesian ideal learning model, the Dynamic Belief Model (DBM), albeit under an incorrect generative assumption of the temporal structure - humans assume reward rates can change over time even though they are actually fixed. We find that the pessimism bias in the bandit task is well captured by the prior mean of DBM when fitted to human choices; but it is poorly captured by the prior mean of the Fixed Belief Model (FBM), an alternative Bayesian model that (correctly) assumes reward rates to be constants. This pessimism bias is also incompletely captured by a simple reinforcement learning model (RL) commonly used in neuroscience and psychology, in terms of fitted initial Q-values. While it seems sub-optimal, and thus mysterious, that humans have an underestimated prior reward expectation, our simulations show that an underestimated prior mean helps to maximize long-term gain, if the observer assumes volatility when reward rates are stable and utilizes a softmax decision policy instead of the optimal one (obtainable by dynamic programming). This raises the intriguing possibility that the brain underestimates reward rates to compensate for the incorrect non-stationarity assumption in the generative model and a simplified decision policy.

bandit task, pessimism bias, reward rate, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

bandit task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Predictive Coding Enhances Meta-RLTo Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability

Provably Efficient Multi-Task Meta Bandit Learning via Shared Representations

In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Improved Bayes Regret Bounds for Multi-Task Hierarchical Bayesian Bandit Algorithms Jiechao Guan 1 Hui Xiong 1,2, 1 AI Thrust, The Hong Kong University of Science and Technology (Guangzhou), China

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

d902c3ce47124c66ce615d5ad9ba304f-Supplemental.pdf

d902c3ce47124c66ce615d5ad9ba304f-Paper.pdf

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task