Soft Policy Optimization: Online Off-Policy RL for Sequence Models