ContinuousMean-CovarianceBandits

Neural Information Processing Systems 

Specifically,inCMCB, there isalearner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions. The agent's objective is to achieve the best trade-off between reward and risk, measured with option covariance. To capture different reward observation scenarios in practice, we considerthreefeedbacksettings,i.e.,full-information,semi-banditandfull-bandit feedback. Wepropose novelalgorithms withoptimal regrets(within logarithmic factors), and provide matching lower bounds to validate their optimalities. The experimental results also demonstrate the superiority of our algorithms.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found