AITopics | Yan, Lin

Collaborating Authors

Yan, Lin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

Shen, Wei, Liu, Guanlin, Wu, Zheng, Zhu, Ruofei, Yang, Qingping, Xin, Chao, Yue, Yu, Yan, Lin

arXiv.org Artificial IntelligenceApr-2-2025

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2503.2223

Country: Asia > China (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Water & Waste Management > Solid Waste Management (0.94)
Education (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Qiying, Zhang, Zheng, Zhu, Ruofei, Yuan, Yufeng, Zuo, Xiaochen, Yue, Yu, Fan, Tiantian, Liu, Gaohong, Liu, Lingjun, Liu, Xin, Lin, Haibin, Lin, Zhiqi, Ma, Bole, Sheng, Guangming, Tong, Yuxuan, Zhang, Chi, Zhang, Mofan, Zhang, Wang, Zhu, Hang, Zhu, Jinhua, Chen, Jiaze, Chen, Jiangjie, Wang, Chengyi, Yu, Hongli, Dai, Weinan, Song, Yuxuan, Wei, Xiangpeng, Zhou, Hao, Liu, Jingjing, Ma, Wei-Ying, Zhang, Ya-Qin, Yan, Lin, Qiao, Mu, Wu, Yonghui, Wang, Mingxuan

arXiv.org Artificial IntelligenceMar-18-2025

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.14476

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Flaming-hot Initiation with Regular Execution Sampling for Large Language Models

Chen, Weizhe, Zhang, Zhicheng, Liu, Guanlin, Zheng, Renjie, Shi, Wenlei, Dun, Chen, Wu, Zheng, Jin, Xing, Yan, Lin

arXiv.org Artificial IntelligenceOct-28-2024

Since the release of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities across various domains. A key challenge in developing these general capabilities is efficiently sourcing diverse, high-quality data. This becomes especially critical in reasoning-related tasks with sandbox checkers, such as math or code, where the goal is to generate correct solutions to specific problems with higher probability. In this work, we introduce Flaming-hot Initiation with Regular Execution (FIRE) sampling, a simple yet highly effective method to efficiently find good responses. Our empirical findings show that FIRE sampling enhances inference-time generation quality and also benefits training in the alignment stage. Furthermore, we explore how FIRE sampling improves performance by promoting diversity and analyze the impact of employing FIRE at different positions within a response.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.21236

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Process Supervision-Guided Policy Optimization for Code Generation

Dai, Ning, Wu, Zheng, Zheng, Renjie, Wei, Ziyun, Shi, Wenlei, Jin, Xing, Liu, Guanlin, Dun, Chen, Huang, Liang, Yan, Lin

arXiv.org Artificial IntelligenceOct-23-2024

Reinforcement Learning (RL) with unit test feedback has enhanced large language models (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our approach increases our in-house LLM's pass rate from 28.2% to 29.8% on LiveCodeBench and from 31.8% to 35.8% on our internal benchmark. Our experimental results highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.17621

Country: North America (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

Liu, Guanlin, Ji, Kaixuan, Zheng, Renjie, Wu, Zheng, Dun, Chen, Gu, Quanquan, Yan, Lin

arXiv.org Artificial IntelligenceOct-11-2024

Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem-solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.09302

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback