AITopics | Zhu, Ruofei

Plotting

Zhu, Ruofei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

Shen, Wei, Liu, Guanlin, Wu, Zheng, Zhu, Ruofei, Yang, Qingping, Xin, Chao, Yue, Yu, Yan, Lin

arXiv.org Artificial IntelligenceApr-2-2025

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2503.2223

Country: Asia > China (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Water & Waste Management > Solid Waste Management (0.94)
Education (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Qiying, Zhang, Zheng, Zhu, Ruofei, Yuan, Yufeng, Zuo, Xiaochen, Yue, Yu, Fan, Tiantian, Liu, Gaohong, Liu, Lingjun, Liu, Xin, Lin, Haibin, Lin, Zhiqi, Ma, Bole, Sheng, Guangming, Tong, Yuxuan, Zhang, Chi, Zhang, Mofan, Zhang, Wang, Zhu, Hang, Zhu, Jinhua, Chen, Jiaze, Chen, Jiangjie, Wang, Chengyi, Yu, Hongli, Dai, Weinan, Song, Yuxuan, Wei, Xiangpeng, Zhou, Hao, Liu, Jingjing, Ma, Wei-Ying, Zhang, Ya-Qin, Yan, Lin, Qiao, Mu, Wu, Yonghui, Wang, Mingxuan

arXiv.org Artificial IntelligenceMar-18-2025

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.14476

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback