Towards Widening The Distillation Bottleneck for Reasoning Models

Yin, Huifeng, Zhao, Yu, Wu, Minghao, Ni, Xuanfan, Zeng, Bo, Wang, Hao, Shi, Tianqi, Shao, Liangying, Lyu, Chenyang, Wang, Longyue, Luo, Weihua, Zhang, Kaifu

Mar-3-2025–arXiv.org Artificial Intelligence

Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning(SFT) and Reinforcement Learning(RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the construted data.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Mar-3-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.88)
  - Natural Language
    - Chatbot (0.70)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)