Review for NeurIPS paper: Differentiable Meta-Learning of Bandit Policies

Jan-22-2025, 01:19:22 GMT–Neural Information Processing Systems

As in standard policy-gradient methods, it seems that two key parameters are the batch-size m and the horizon n. It would be good to provide some sensitivity analysis on these parameters to better assess how the approach scales to complex problems. In particular, what is the effect of the horizon on the gradient estimation? Does the variance blow up or is the baseline sufficient to keep it under control? In this sense, it might be good to have differentiable strategies that are provably efficient (e.g., with sub-linear regret) for a range of parameter values, so that whather value of \theta we encounter during its optimization will not performed poorly.

bandit policy, differentiable meta-learning, neurips paper, (5 more...)

Neural Information Processing Systems

Jan-22-2025, 01:19:22 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)