ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training