From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

Lyu, Yuanjie, Wang, Chengyu, Huang, Jun, Xu, Tong

arXiv.org Artificial Intelligence 

Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student can cause compounding errors. We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix preceding the earliest error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and enhances training stability. On 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher. Recent advances in Large Language Models (LLMs) have led to the rise of "agents" (Xi et al., 2025). Unlike traditional single-pass generation, LLM agents solve complex problems through an iterative reasoning-action-observation loop, using frameworks such as ReAct (Y ao et al., 2023). Specifically, LLM agents decompose tasks into sub-goals (Reasoning), execute them via external tools such as code interpreters (Action) (Gao et al., 2023), and then refine their plans based on feedback from tool execution (Observation). By combining LLM planning with the precision of external tools, agents mitigate flaws of LLMs such as hallucinations, outdated knowledge, and weak numerical reasoning, achieving strong performance on real-world interactive tasks (Liu et al., 2023).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found