Goto

Collaborating Authors

 preference reversal


Reinforcement Learning with Non-Exponential Discounting

Neural Information Processing Systems

Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton-Jacobi-Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.



Classifying Inconsistency in AHP Pairwise Comparison Matrices Using Machine Learning

arXiv.org Machine Learning

Assessing consistency in Pairwise Comparison Matrices (PCMs) within the Analytical Hierarchy Process (AHP) poses significant challenges when using the traditional Consistency Ratio (CR) method. This study introduces a novel alternative that leverages triadic preference reversals (PR) to provide a more robust and interpretable assessment of consistency. Triadic preference reversals capture inconsistencies between a pair of elements by comparing the direction of preference derived from the global eigenvector with that from a 3x3 submatrix (triad) containing the same pair, highlighting local-global preference conflicts. This method detects a reversal when one eigen ratio exceeds one while another falls below one, signaling inconsistency. We identify two key features: the proportion of preference reversals and the maximum reversal, which mediate the impact of a PCM's order on its consistency. Using these features simulated PCMs are clustered into consistent and inconsistent classes through k-means clustering, followed by training a logistic classifier for consistency evaluation. The PR method achieves 97\% accuracy, significantly surpassing the Consistency Ratio (CR) method's 50%, with a false negative rate of only 2.6\% compared to 5.5\%. These findings demonstrate the PR method's superior accuracy in assessing AHP consistency, thereby enabling more reliable decision-making. The proposed triadic preference reversal (PR) approach is implemented in the R package AHPtools publicly available on the Comprehensive R Archive Network (CRAN).


Pacos: Modeling Users' Interpretable and Context-Dependent Choices in Preference Reversals

arXiv.org Artificial Intelligence

Choice problems refer to the problem of selecting the best choices from several available items, and learning users' preferences in choice problems is of great importance in understanding users' decision making mechanisms and providing personalized services. Existing works typically assume that people evaluate items independently. In practice, however, users' preferences depend on the market in which items are placed, which is known as the context effects; and the order of users' preferences for two items may even be reversed, which is called to preference reversals. In this work, we identify three factors contributing to the context effects: users' adaptive weights, the inter-item comparison, and display positions. We propose a context-dependent preference model named Pacos as a unified framework to address three factors simultaneously, and consider two design methods including an additive method with high interpretability and an ANN-based method with high accuracy. We study the conditions for preference reversals to occur and provide a theoretical proof of the effectiveness of Pacos in predicting when preference reversals would occur. Experimental results show that the proposed method has better performance than prior works in predicting users' choices, and has great interpretability to help understand the cause of preference reversals. Choice problems, such as purchasing a festival gift or picking a restaurant, involve comparing several available items. Previous works on preference modeling and analysis typically assume that people evaluate items independently, and the relative preference between two items is fixed regardless of other competing options [1]. However, numerous studies show that the above independence assumption is frequently violated in reality [2], [3]. It is essential to model how the relative preference is influenced by competing options and figure out how people select their best choices. This study can help understand users' decision making mechanisms and offer personalized services, and provide important guidelines on pricing strategies and sales forecasts. To show this independence violation, we conduct a real user test. In our test, we set two markets of Xiaomi scale, as shown in Figure 1 (a) and (b). In these two markets, we consider sellers described by two attributes: price (¥) and seller reputation (REP).


Reinforcement Learning with Non-Exponential Discounting

arXiv.org Artificial Intelligence

Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton-Jacobi-Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.


How can we make sure that algorithms are fair?

#artificialintelligence

Using machines to augment human activity is nothing new. Egyptian hieroglyphs show the use of horse-drawn carriages even before 300 B.C. Ancient Indian literature such as "Silapadikaram" has described animals being used for farming. And one glance outside shows that today people use motorized vehicles to get around. Where in the past human beings have augmented ourselves in physical ways, now the nature of augmentation also is more intelligent. Again, all one needs to do is look to cars – engineers are seemingly on the cusp of self-driving cars guided by artificial intelligence.