ContinuousMean-CovarianceBandits
–Neural Information Processing Systems
Specifically,inCMCB, there isalearner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions. The agent's objective is to achieve the best trade-off between reward and risk, measured with option covariance. To capture different reward observation scenarios in practice, we considerthreefeedbacksettings,i.e.,full-information,semi-banditandfull-bandit feedback. Wepropose novelalgorithms withoptimal regrets(within logarithmic factors), and provide matching lower bounds to validate their optimalities. The experimental results also demonstrate the superiority of our algorithms.
Neural Information Processing Systems
Feb-7-2026, 08:55:12 GMT
- Technology: