A Simpler Alternative to Variational Regularized Counterfactual Risk Minimization
Bakker, Hua Chang, Gupta, Shashank, Oosterhuis, Harrie
–arXiv.org Artificial Intelligence
Variance regularized counterfactual risk minimization (VRCRM) has been proposed as an alternative off-policy learning (OPL) method. VRCRM method uses a lower-bound on the $f$-divergence between the logging policy and the target policy as regularization during learning and was shown to improve performance over existing OPL alternatives on multi-label classification tasks. In this work, we revisit the original experimental setting of VRCRM and propose to minimize the $f$-divergence directly, instead of optimizing for the lower bound using a $f$-GAN approach. Surprisingly, we were unable to reproduce the results reported in the original setting. In response, we propose a novel simpler alternative to f-divergence optimization by minimizing a direct approximation of f-divergence directly, instead of a $f$-GAN based lower bound. Experiments showed that minimizing the divergence using $f$-GANs did not work as expected, whereas our proposed novel simpler alternative works better empirically.
arXiv.org Artificial Intelligence
Sep-15-2024
- Country:
- Europe
- Finland (0.04)
- Netherlands
- North Holland > Amsterdam (0.06)
- Gelderland > Nijmegen (0.04)
- Europe
- Genre:
- Research Report > New Finding (0.49)
- Technology: