regret value
Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications
Xu, Luyue, Wang, Liming, Xie, Hong, Zhou, Mingqiang
Contextual bandits serve as a fundamental algorithmic framework for optimizing recommendation decisions online. Though extensive attention has been paid to tailoring contextual bandits for recommendation applications, the "herding effects" in user feedback have been ignored. These herding effects bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits. This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects. A user feedback model is formulated to capture this feedback bias. We design the TS-Conf (Thompson Sampling under Conformity) algorithm, which employs posterior sampling to balance the exploration and exploitation tradeoff. We prove an upper bound for the regret of the algorithm, revealing the impact of herding effects on learning speed. Extensive experiments on datasets demonstrate that TS-Conf outperforms four benchmark algorithms. Analysis reveals that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.
NNCFR: Minimize Counterfactual Regret with Neural Networks
Li, Huale, Wang, Xuan, Guo, Zengyue, Zhang, Jiajia, Qi, Shuhan
Counterfactual Regret Minimization (CFR)} is the popular method for finding approximate Nash equilibrium in two-player zero-sum games with imperfect information. CFR solves games by travsersing the full game tree iteratively, which limits its scalability in larger games. When applying CFR to solve large-scale games in previously, large-scale games are abstracted into small-scale games firstly. Secondly, CFR is used to solve the abstract game. And finally, the solution strategy is mapped back to the original large-scale game. However, this process requires considerable expert knowledge, and the accuracy of abstraction is closely related to expert knowledge. In addition, the abstraction also loses certain information, which will eventually affect the accuracy of the solution strategy. Towards this problem, a recent method, \textit{Deep CFR} alleviates the need for abstraction and expert knowledge by applying deep neural networks directly to CFR in full games. In this paper, we introduces \textit{Neural Network Counterfactual Regret Minimization (NNCFR)}, an improved variant of \textit{Deep CFR} that has a faster convergence by constructing a dueling netwok as the value network. Moreover, an evaluation module is designed by combining the value network and Monte Carlo, which reduces the approximation error of the value network. In addition, a new loss function is designed in the procedure of training policy network in the proposed \textit{NNCFR}, which can be good to make the policy network more stable. The extensive experimental tests are conducted to show that the \textit{NNCFR} converges faster and performs more stable than \textit{Deep CFR}, and outperforms \textit{Deep CFR} with respect to exploitability and head-to-head performance on test games.
- North America > United States > Texas (0.05)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
CFR-MIX: Solving Imperfect Information Extensive-Form Games with Combinatorial Action Space
Li, Shuxin, Zhang, Youzhi, Wang, Xinrun, Xue, Wanqi, An, Bo
In many real-world scenarios, a team of agents coordinate with each other to compete against an opponent. The challenge of solving this type of game is that the team's joint action space grows exponentially with the number of agents, which results in the inefficiency of the existing algorithms, e.g., Counterfactual Regret Minimization (CFR). To address this problem, we propose a new framework of CFR: CFR-MIX. Firstly, we propose a new strategy representation that represents a joint action strategy using individual strategies of all agents and a consistency relationship to maintain the cooperation between agents. To compute the equilibrium with individual strategies under the CFR framework, we transform the consistency relationship between strategies to the consistency relationship between the cumulative regret values. Furthermore, we propose a novel decomposition method over cumulative regret values to guarantee the consistency relationship between the cumulative regret values. Finally, we introduce our new algorithm CFR-MIX which employs a mixing layer to estimate cumulative regret values of joint actions as a non-linear combination of cumulative regret values of individual actions. Experimental results show that CFR-MIX outperforms existing algorithms on various games significantly.
- Asia > Singapore (0.14)
- North America > Canada > Alberta (0.14)
- North America > United States > Texas (0.04)
Algorithms for Average Regret Minimization
Storandt, Sabine (University of Konstanz) | Funke, Stefan (University of Stuttgart)
In this paper, we study a problem from the realm of multi-criteria decision making in which the goal is to select from a given set S of d-dimensional objects a minimum sized subset S' with bounded regret. Thereby, regret measures the unhappiness of users which would like to select their favorite object from set S but now can only select their favorite object from the subset S'. Previous work focused on bounding the maximum regret which is determined by the most unhappy user. We propose to consider the average regret instead which is determined by the sum of (un)happiness of all possible users. We show that this regret measure comes with desirable properties as supermodularity which allows to construct approximation algorithms. Furthermore, we introduce the regret minimizing permutation problem and discuss extensions of our algorithms to the recently proposed k-regret measure. Our theoretical results are accompanied with experiments on a variety of inputs with d up to 7.
- North America > United States > California > San Francisco County > San Francisco (0.15)
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
- North America > United States > Hawaii (0.05)
The Impatient May Use Limited Optimism to Minimize Regret
Cadilhac, Michaël, Pérez, Guillermo A., Bogaard, Marie van den
Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE.
- North America > United States (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- (7 more...)
Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)
Ahmed, Asrar, Varakantham, Pradeep, Lowalekar, Meghna, Adulyasak, Yossiri, Jaillet, Patrick
Markov Decision Processes (MDPs) are an effective model to represent decision processes in the presence of transitional uncertainty and reward tradeoffs. However, due to the difficulty in exactly specifying the transition and reward functions in MDPs, researchers have proposed uncertain MDP models and robustness objectives in solving those models. Most approaches for computing robust policies have focused on the computation of maximin policies which maximize the value in the worst case amongst all realisations of uncertainty. Given the overly conservative nature of maximin policies, recent work has proposed minimax regret as an ideal alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only and they are also limited in their scalability. Therefore, we provide a general model of uncertain MDPs that considers uncertainty over both transition and reward functions. Furthermore, we also consider dependence of the uncertainty across different states and decision epochs. We also provide a mixed integer linear program formulation for minimizing regret given a set of samples of the transition and reward functions in the uncertain MDP. In addition, we provide two myopic variants of regret, namely Cumulative Expected Myopic Regret (CEMR) and One Step Regret (OSR) that can be optimized in a scalable manner. Specifically, we provide dynamic programming and policy iteration based algorithms to optimize CEMR and OSR respectively. Finally, to demonstrate the effectiveness of our approaches, we provide comparisons on two benchmark problems from literature. We observe that optimizing the myopic variants of regret, OSR and CEMR are better than directly optimizing the regret.
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Indiana (0.04)
- North America > Canada > Quebec > Montreal (0.04)