Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

Open in new window