Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems

Neural Information Processing Systems 

In the stochastic contextual low-rank matrix bandit problem, the expected reward of an action is given by the inner product between the action's feature matrix and some fixed, but initially unknown d_1 by d_2 matrix \Theta * with rank r \ll \{d_1, d_2\}, and an agent sequentially takes actions based on past experience to maximize the cumulative reward. In this paper, we study the generalized low-rank matrix bandit problem, which has been recently proposed in \cite{lu2021low} under the Generalized Linear Model (GLM) framework. To overcome the computational infeasibility and theoretical restrain of existing algorithms on this problem, we first propose the G-ESTT framework that modifies the idea from \cite{jun2019bilinear} by using Stein's method on the subspace estimation and then leverage the estimated subspaces via a regularization idea. Furthermore, we remarkably improve the efficiency of G-ESTT by using a novel exclusion idea on the estimated subspace instead, and propose the G-ESTS framework. We also show that both of our methods are the first algorithm to achieve the optimal \tilde{O}((d_1 d_2)r\sqrt{T}) bound of regret presented in \cite{lu2021low} up to logarithm terms under some mild conditions, which improves upon the current regret of \tilde{O}((d_1 d_2) {3/2} \sqrt{rT}) \citep{lu2021low}.