Goto

Collaborating Authors

 luffy


Learning to Reason under Off-Policy Guidance

Yan, Jianhao, Li, Yafu, Hu, Zican, Wang, Zhi, Cui, Ganqu, Qu, Xiaoye, Cheng, Yu, Zhang, Yue

arXiv.org Artificial Intelligence

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.


One Piece: 10 Best Finishing Moves, Ranked

#artificialintelligence

As a good battle shonen should, One Piece has a variety of iconic fights that are complemented by some flashy and exciting finishing moves. A staple of any anime fight, the idea of the finishing move fuses both brutal power and explicit branding to create some of the most exciting and recognizable attacks ever seen in fiction. In One Piece's world, Devil Fruits, Haki, martial arts, robotic enhancements, and some very loose interpretations of physics all contribute to One Piece's own, colorful gallery of finishing moves. And while rating each one's power and effectiveness is a large discussion within its own right, it's also really fun just looking at which finishing moves are just the coolest and most memorable. Monkey D. Luffy's Gum-Gum Gatling doesn't have the awe-inspiring, simplistic appeal of a one-hit attack; but what it lacks in brevity, it more than makes up for with raw, visceral spectacle.