AITopics | luffy

Collaborating Authors

luffy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning to Reason under Off-Policy Guidance

Neural Information Processing SystemsJun-21-2026, 13:02:39 GMT

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards (RLVR). However, existing RLVR approaches are inherently "on-policy", limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments RLVR with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the MixedPolicy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over +6.4 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Asia > China (0.46)
North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.67)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
(2 more...)

Add feedback

Learning to Reason under Off-Policy Guidance

Yan, Jianhao, Li, Yafu, Hu, Zican, Wang, Zhi, Cui, Ganqu, Qu, Xiaoye, Cheng, Yu, Zhang, Yue

arXiv.org Artificial IntelligenceJun-24-2025

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2504.14945

Country:

Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
(3 more...)

Add feedback

One Piece: 10 Best Finishing Moves, Ranked

#artificialintelligenceOct-12-2021, 09:15:58 GMT

As a good battle shonen should, One Piece has a variety of iconic fights that are complemented by some flashy and exciting finishing moves. A staple of any anime fight, the idea of the finishing move fuses both brutal power and explicit branding to create some of the most exciting and recognizable attacks ever seen in fiction. In One Piece's world, Devil Fruits, Haki, martial arts, robotic enhancements, and some very loose interpretations of physics all contribute to One Piece's own, colorful gallery of finishing moves. And while rating each one's power and effectiveness is a large discussion within its own right, it's also really fun just looking at which finishing moves are just the coolest and most memorable. Monkey D. Luffy's Gum-Gum Gatling doesn't have the awe-inspiring, simplistic appeal of a one-hit attack; but what it lacks in brevity, it more than makes up for with raw, visceral spectacle.

intensity, luffy, zoro, (15 more...)

#artificialintelligence

Industry: Leisure & Entertainment (0.50)

Technology: Information Technology > Artificial Intelligence (0.35)

Add feedback