Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning