AITopics | Zhai, Yuanzhao

Collaborating Authors

Zhai, Yuanzhao

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Correcting Large Language Model Behavior via Influence Function

Zhang, Han, Zhang, Zhuo, Zhang, Yi, Zhai, Yuanzhao, Peng, Hanyang, Lei, Yu, Yu, Yue, Wang, Hui, Liang, Bin, Gui, Lin, Xu, Ruifeng

arXiv.org Artificial IntelligenceDec-20-2024

Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, whether they involve the curation of new data for continual alignment or the manual correction of outdated data for re-alignment, demand costly human resources. To address this challenge, we propose a novel approach, Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET), which requires no human involvement. LANCET consists of two phases: (1) using influence functions to identify the training data that significantly impact undesirable model outputs, and (2) applying an Influence function-driven Bregman Optimization (IBO) technique to adjust the model's behavior based on these influence distributions. Our experiments demonstrate that LANCET effectively and efficiently correct inappropriate behaviors of LLMs. Furthermore, LANCET can outperform methods that rely on collecting human preferences, and it enhances the interpretability of learning human preferences within LLMs.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.16451

Country:

Asia > China (0.46)
North America (0.46)

Genre:

Research Report > Promising Solution (0.66)
Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Online Self-Preferring Language Models

Zhai, Yuanzhao, Zhang, Zhuo, Xu, Kele, Peng, Hanyang, Yu, Yue, Feng, Dawei, Yang, Cheng, Ding, Bo, Wang, Huaimin

arXiv.org Artificial IntelligenceMay-22-2024

Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.

large language model, machine learning, preference strength, (18 more...)

arXiv.org Artificial Intelligence

2405.14103

Country: Asia > China (0.46)

Genre: Research Report (1.00)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

COPR: Continual Human Preference Learning via Optimal Policy Regularization

Zhang, Han, Gui, Lin, Lei, Yu, Zhai, Yuanzhao, Zhang, Yehong, He, Yulan, Wang, Hui, Yu, Yue, Wong, Kam-Fai, Liang, Bin, Xu, Ruifeng

arXiv.org Artificial IntelligenceFeb-27-2024

Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the current policy based on the historically optimal policy, which prevents CF and avoids over-emphasizing unbalanced objectives. We also provide formal proof for the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment. Furthermore, we validate the robustness of COPR under various CL settings, including different backbones, replay memory sizes, and learning orders.

continual human preference learning, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2402.14228

Country:

Asia > China (0.46)
North America > United States > Oregon (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Education (1.00)
Media > Film (0.94)
Leisure & Entertainment (0.94)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Zhai, Yuanzhao, Li, Yiying, Gao, Zijian, Gong, Xudong, Xu, Kele, Feng, Dawei, Bo, Ding, Wang, Huaimin

arXiv.org Artificial IntelligenceJan-11-2024

Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2401.05899

Country:

North America > United States (0.29)
Asia > China (0.28)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles

Zhai, Yuanzhao, Zhang, Han, Lei, Yu, Yu, Yue, Xu, Kele, Feng, Dawei, Ding, Bo, Wang, Huaimin

arXiv.org Artificial IntelligenceDec-30-2023

Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher rewards leads to a decline in human preferences. In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization. To mitigate this limitation, we scrutinize the RLHF objective in the offline dataset and propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning. To enhance the uncertainty quantification abilities for reward models, we first propose a diverse low-rank adaptation (LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations. Then we optimize policy models utilizing penalized rewards, determined by both rewards and uncertainties provided by the diverse reward LoRA ensembles. Our experimental results, based on two real human preference datasets, showcase the effectiveness of diverse reward LoRA ensembles in quantifying reward uncertainty. Additionally, uncertainty regularization in UP-RLHF proves to be pivotal in mitigating overoptimization, thereby contributing to the overall performance.

artificial intelligence, machine learning, reward model, (13 more...)

arXiv.org Artificial Intelligence

2401.00243

Country: Asia > China (0.46)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration

Gao, Zijian, Li, YiYing, Xu, Kele, Zhai, Yuanzhao, Feng, Dawei, Ding, Bo, Mao, XinJun, Wang, Huaimin

arXiv.org Artificial IntelligenceNov-19-2023

The sparsity of extrinsic rewards poses a serious challenge for reinforcement learning (RL). Currently, many efforts have been made on curiosity which can provide a representative intrinsic reward for effective exploration. However, the challenge is still far from being solved. In this paper, we present a novel curiosity for RL, named DyMeCu, which stands for Dynamic Memory-based Curiosity. Inspired by human curiosity and information theory, DyMeCu consists of a dynamic memory and dual online learners. The curiosity arouses if memorized information can not deal with the current state, and the information gap between dual learners can be formulated as the intrinsic reward for agents, and then such state information can be consolidated into the dynamic memory. Compared with previous curiosity methods, DyMeCu can better mimic human curiosity with dynamic memory, and the memory module can be dynamically grown based on a bootstrap paradigm with dual learners. On multiple benchmarks including DeepMind Control Suite and Atari Suite, large-scale empirical experiments are conducted and the results demonstrate that DyMeCu outperforms competitive curiosity-based methods with or without extrinsic rewards. We will release the code to enhance reproducibility.

curiosity, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2208.11349

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

COPF: Continual Learning Human Preference through Optimal Policy Fitting

Zhang, Han, Gui, Lin, Zhai, Yuanzhao, Wang, Hui, Lei, Yu, Xu, Ruifeng

arXiv.org Artificial IntelligenceOct-27-2023

In the realm of natural language processing (NLP), large language models (LLMs) are vital tools with the potential to bridge human language and machine understanding. Learning human preferences is a crucial step towards ensuring that language models not only generate responses that are useful to users but also adhere to ethical and societal norms, namely helpful and harmless responses [1]. However, they face a fundamental challenge in aligning with human preferences and values, hindering their full potential. Traditional alignment methods, namely Reinforcement Learning from Human Feedback (RLHF) [2, 3], involve supervised fine-tuning (SFT), reward model (RM) training, and policy model training. This complex pipeline lacks flexibility for continual learning (CL) of human preferences, hence existing work [1] often necessitates retraining models to adapt to dynamic preferences. Hence, there is a pressing need for research into continual alignment methods to address this limitation, enabling LLMs to better adhere to evolving human preferences and values while generating helpful responses. In this paper, we propose an innovative approach to address these challenges by enhancing the utility of the Deterministic Policy Optimization (DPO) [4] algorithm, a non-reinforcement learning, and a non-continual learning method. DPO, rooted in rigorous reinforcement learning theory, offers promising advantages but suffers from three critical limitations: 1. DPO is not supported for evolving human preferences which is common in real-world applications.

large language model, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2310.15694

Country:

Europe (0.67)
North America > United States > Oregon (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Overview > Innovation (0.48)
Research Report > Promising Solution (0.34)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

Gao, Zijian, Xu, Kele, Zhai, Yuanzhao, Feng, Dawei, Ding, Bo, Mao, XinJun, Wang, Huaimin

arXiv.org Artificial IntelligenceJun-26-2023

Under sparse extrinsic reward settings, reinforcement learning has remained challenging, despite surging interests in this field. Previous attempts suggest that intrinsic reward can alleviate the issue caused by sparsity. In this article, we present a novel intrinsic reward that is inspired by human learning, as humans evaluate curiosity by comparing current observations with historical knowledge. Our method involves training a self-supervised prediction model, saving snapshots of the model parameters, and using nuclear norm to evaluate the temporal inconsistency between the predictions of different snapshots as intrinsic rewards. We also propose a variational weighting mechanism to assign weight to different snapshots in an adaptive manner. Our experimental results on various benchmark environments demonstrate the efficacy of our method, which outperforms other intrinsic reward-based methods without additional training costs and with higher noise tolerance. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

intrinsic reward, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2208.11361

Country: Asia (0.28)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment > Games (0.48)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback