Contextual bandits with entropy-based human feedback

Seraj, Raihan, Meng, Lili, Sylvain, Tristan

Feb-12-2025–arXiv.org Artificial Intelligence

This work investigates how explicit human feedback can enhance CB performance. Building on successful integrations In recent years, preference-based human feedback of human guidance in reinforcement learning (Christiano mechanisms have become essential for enhancing et al., 2017; MacGlashan et al., 2017) and conversational model performance across diverse applications, AI (Achiam et al., 2023), we distinguish two primary feedback including conversational AI systems such as Chat-paradigms: (1) action-based feedback, where experts GPT. However, existing approaches often neglect directly prescribe optimal actions for specific contexts (Osa critical aspects, such as model uncertainty and et al., 2018; Li et al., 2023), and (2) preference-based feedback, the variability in feedback quality. To address where humans compare pairs of learner-generated actions these challenges, we introduce an entropy-based to express relative preferences (Christiano et al., 2017; human feedback framework for contextual bandits, Saha et al., 2023). While action-based methods require precise which dynamically balances exploration and expert knowledge, we focus on preference feedback for exploitation by soliciting expert feedback only its practical advantages in scalable data collection, notably when model entropy exceeds a predefined threshold.

bandit, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

Feb-12-2025

arXiv.org PDF

Add feedback

Country:
- North America > Canada > Quebec > Montreal (0.14)

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (1.00)
  - Reinforcement Learning (0.89)