Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs