Online Policy Learning via a Self-Normalized Maximal Inequality