AITopics | belief model

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Neural Information Processing SystemsMar-17-2026, 02:37:20 GMT

How humans make repeated choices among options with imperfectly known reward outcomes is an important problem in psychology and neuroscience. This is often studied using multi-armed bandits, which is also frequently studied in machine learning. We present data from a human stationary bandit experiment, in which we vary the average abundance and variability of reward availability (mean and variance of reward rate distributions). Surprisingly, we find subjects significantly underestimate prior mean of reward rates -- based on their self-report, at the end of a game, on their reward expectation of non-chosen arms. Previously, human learning in the bandit task was found to be well captured by a Bayesian ideal learning model, the Dynamic Belief Model (DBM), albeit under an incorrect generative assumption of the temporal structure - humans assume reward rates can change over time even though they are actually fixed. We find that the pessimism bias in the bandit task is well captured by the prior mean of DBM when fitted to human choices; but it is poorly captured by the prior mean of the Fixed Belief Model (FBM), an alternative Bayesian model that (correctly) assumes reward rates to be constants. This pessimism bias is also incompletely captured by a simple reinforcement learning model (RL) commonly used in neuroscience and psychology, in terms of fitted initial Q-values. While it seems sub-optimal, and thus mysterious, that humans have an underestimated prior reward expectation, our simulations show that an underestimated prior mean helps to maximize long-term gain, if the observer assumes volatility when reward rates are stable and utilizes a softmax decision policy instead of the optimal one (obtainable by dynamic programming). This raises the intriguing possibility that the brain underestimates reward rates to compensate for the incorrect non-stationarity assumption in the generative model and a simplified decision policy.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

YAN ZHENG, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, Changjie Fan

Neural Information Processing SystemsFeb-13-2026, 13:17:08 GMT

Inmultiagent domains, coping withnon-stationary agents thatchange behaviors from time to time is a challenging problem, where an agent is usually required to be able to quickly detect the other agent's policy during online interaction, and then adapt its own policy accordingly.

artificial intelligence, machine learning, opponent, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Tianjin Province > Tianjin (0.05)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Demystifying excessively volatile human learning: A Bayesian persistent prior and a neural approximation

Chaitanya Ryali, Gautam Reddy, Angela J. Yu

Neural Information Processing SystemsFeb-13-2026, 08:19:42 GMT

Neural Information Processing Systems http://nips.cc/

approximation, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > San Diego County > La Jolla (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

631f99d8e860054410c239fc90d18270-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 10:23:52 GMT

agent, belief model, replay buffer, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.42)

Add feedback

Off-TeamLearning

Neural Information Processing SystemsFeb-9-2026, 10:23:48 GMT

Second, OBL policies may bebrittle when paired with anoveltest time partner,e.g.

artificial intelligence, belief model, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)

Add feedback

Off-Team Learning

Neural Information Processing SystemsDec-24-2025, 08:22:11 GMT

Zero-shot coordination (ZSC) evaluates an algorithm by the performance of a team of agents that were trained independently under that algorithm. Off-belief learning (OBL) is a recent method that achieves state-of-the-art results in ZSC in the game Hanabi. However, the implementation of OBL relies on a belief model that experiences covariate shift. Moreover, during ad-hoc coordination, OBL or any other neural policy may experience test-time covariate shift.

electronic proceedings, name change, off-team learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.78)

Add feedback

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Neural Information Processing SystemsNov-20-2025, 23:16:00 GMT

How humans make repeated choices among options with imperfectly known reward outcomes is an important problem in psychology and neuroscience. This is often studied using multi-armed bandits, which is also frequently studied in machine learning. We present data from a human stationary bandit experiment, in which we vary the average abundance and variability of reward availability (mean and variance of reward rate distributions). Surprisingly, we find subjects significantly underestimate prior mean of reward rates -- based on their self-report, at the end of a game, on their reward expectation of non-chosen arms. Previously, human learning in the bandit task was found to be well captured by a Bayesian ideal learning model, the Dynamic Belief Model (DBM), albeit under an incorrect generative assumption of the temporal structure - humans assume reward rates can change over time even though they are actually fixed. We find that the pessimism bias in the bandit task is well captured by the prior mean of DBM when fitted to human choices; but it is poorly captured by the prior mean of the Fixed Belief Model (FBM), an alternative Bayesian model that (correctly) assumes reward rates to be constants. This pessimism bias is also incompletely captured by a simple reinforcement learning model (RL) commonly used in neuroscience and psychology, in terms of fitted initial Q-values. While it seems sub-optimal, and thus mysterious, that humans have an underestimated prior reward expectation, our simulations show that an underestimated prior mean helps to maximize long-term gain, if the observer assumes volatility when reward rates are stable and utilizes a softmax decision policy instead of the optimal one (obtainable by dynamic programming). This raises the intriguing possibility that the brain underestimates reward rates to compensate for the incorrect non-stationarity assumption in the generative model and a simplified decision policy.

bandit task, pessimism bias, reward rate, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

YAN ZHENG, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, Changjie Fan

Neural Information Processing SystemsNov-20-2025, 17:57:46 GMT

There also exist many application scenarios involving multiagent interactions, commonly known as multiagent systems (MAS).

artificial intelligence, opponent, response policy, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Tianjin Province > Tianjin (0.05)
North America > Canada > Quebec > Montreal (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(2 more...)

Industry: Leisure & Entertainment (0.93)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Demystifying excessively volatile human learning: A Bayesian persistent prior and a neural approximation

Chaitanya Ryali, Gautam Reddy, Angela J. Yu

Neural Information Processing SystemsNov-20-2025, 17:33:51 GMT

It has been found that DBM captures human behavior better than FBM, even though the latter more veridically captures experimental design in a variety of tasks, e.g.

approximation, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > San Diego County > La Jolla (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

A Experimental Details

Neural Information Processing SystemsAug-15-2025, 07:54:35 GMT

We dynamically batch model calls onto the GPU in order to increase inference speed. For OBL, there are dependencies between policy and belief training. The entire inference and training infrastructure for a single policy or belief model uses a machine with 30 CPU cores and 2 GPUs, one GPU for training and one for simulation. We use their public-lstm architecture design. We use a 3-layer feedforward neural network to encode the entire private observation.

agent, belief model, replay buffer, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Filters

Collaborating Authors

belief model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

Demystifying excessively volatile human learning: A Bayesian persistent prior and a neural approximation

631f99d8e860054410c239fc90d18270-Supplemental-Conference.pdf

Off-TeamLearning

Off-Team Learning

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

Demystifying excessively volatile human learning: A Bayesian persistent prior and a neural approximation

A Experimental Details