Not enough data to create a plot.
Try a different view from the menu above.
Lei, Yu
Dual-Granularity Medication Recommendation Based on Causal Inference
Liang, Shunpan, Li, Xiang, Li, Xiang, Li, Chen, Lei, Yu, Hou, Yulei, Ma, Tengfei
As medical demands grow and machine learning technology advances, AI-based diagnostic and treatment systems are garnering increasing attention. Medication recommendation aims to integrate patients' long-term health records with medical knowledge, recommending accuracy and safe medication combinations for specific conditions. However, most existing researches treat medication recommendation systems merely as variants of traditional recommendation systems, overlooking the heterogeneity between medications and diseases. To address this challenge, we propose DGMed, a framework for medication recommendation. DGMed utilizes causal inference to uncover the connections among medical entities and presents an innovative feature alignment method to tackle heterogeneity issues. Specifically, this study first applies causal inference to analyze the quantified therapeutic effects of medications on specific diseases from historical records, uncovering potential links between medical entities. Subsequently, we integrate molecular-level knowledge, aligning the embeddings of medications and diseases within the molecular space to effectively tackle their heterogeneity. Ultimately, based on relationships at the entity level, we adaptively adjust the recommendation probabilities of medication and recommend medication combinations according to the patient's current health condition. Experimental results on a real-world dataset show that our method surpasses existing state-of-the-art baselines in four evaluation metrics, demonstrating superior performance in both accuracy and safety aspects. Compared to the sub-optimal model, our approach improved accuracy by 4.40%, reduced the risk of side effects by 6.14%, and increased time efficiency by 47.15%.
COPR: Continual Human Preference Learning via Optimal Policy Regularization
Zhang, Han, Gui, Lin, Lei, Yu, Zhai, Yuanzhao, Zhang, Yehong, He, Yulan, Wang, Hui, Yu, Yue, Wong, Kam-Fai, Liang, Bin, Xu, Ruifeng
Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the current policy based on the historically optimal policy, which prevents CF and avoids over-emphasizing unbalanced objectives. We also provide formal proof for the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment. Furthermore, we validate the robustness of COPR under various CL settings, including different backbones, replay memory sizes, and learning orders.
Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles
Zhai, Yuanzhao, Zhang, Han, Lei, Yu, Yu, Yue, Xu, Kele, Feng, Dawei, Ding, Bo, Wang, Huaimin
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher rewards leads to a decline in human preferences. In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization. To mitigate this limitation, we scrutinize the RLHF objective in the offline dataset and propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning. To enhance the uncertainty quantification abilities for reward models, we first propose a diverse low-rank adaptation (LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations. Then we optimize policy models utilizing penalized rewards, determined by both rewards and uncertainties provided by the diverse reward LoRA ensembles. Our experimental results, based on two real human preference datasets, showcase the effectiveness of diverse reward LoRA ensembles in quantifying reward uncertainty. Additionally, uncertainty regularization in UP-RLHF proves to be pivotal in mitigating overoptimization, thereby contributing to the overall performance.
COPF: Continual Learning Human Preference through Optimal Policy Fitting
Zhang, Han, Gui, Lin, Zhai, Yuanzhao, Wang, Hui, Lei, Yu, Xu, Ruifeng
In the realm of natural language processing (NLP), large language models (LLMs) are vital tools with the potential to bridge human language and machine understanding. Learning human preferences is a crucial step towards ensuring that language models not only generate responses that are useful to users but also adhere to ethical and societal norms, namely helpful and harmless responses [1]. However, they face a fundamental challenge in aligning with human preferences and values, hindering their full potential. Traditional alignment methods, namely Reinforcement Learning from Human Feedback (RLHF) [2, 3], involve supervised fine-tuning (SFT), reward model (RM) training, and policy model training. This complex pipeline lacks flexibility for continual learning (CL) of human preferences, hence existing work [1] often necessitates retraining models to adapt to dynamic preferences. Hence, there is a pressing need for research into continual alignment methods to address this limitation, enabling LLMs to better adhere to evolving human preferences and values while generating helpful responses. In this paper, we propose an innovative approach to address these challenges by enhancing the utility of the Deterministic Policy Optimization (DPO) [4] algorithm, a non-reinforcement learning, and a non-continual learning method. DPO, rooted in rigorous reinforcement learning theory, offers promising advantages but suffers from three critical limitations: 1. DPO is not supported for evolving human preferences which is common in real-world applications.
FocalDreamer: Text-driven 3D Editing via Focal-fusion Assembly
Li, Yuhan, Dou, Yishun, Shi, Yue, Lei, Yu, Chen, Xuanhong, Zhang, Yi, Zhou, Peng, Ni, Bingbing
While text-3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.
Self-adaptive Multi-task Particle Swarm Optimization
Zheng, Xiaolong, Zhou, Deyun, Li, Na, Lei, Yu, Wu, Tao, Gong, Maoguo
Multi-task optimization (MTO) studies how to simultaneously solve multiple optimization problems for the purpose of obtaining better performance on each problem. Over the past few years, evolutionary MTO (EMTO) was proposed to handle MTO problems via evolutionary algorithms. So far, many EMTO algorithms have been developed and demonstrated well performance on solving real-world problems. However, there remain many works to do in adapting knowledge transfer to task relatedness in EMTO. Different from the existing works, we develop a self-adaptive multi-task particle swarm optimization (SaMTPSO) through the developed knowledge transfer adaptation strategy, the focus search strategy and the knowledge incorporation strategy. In the knowledge transfer adaptation strategy, each task has a knowledge source pool that consists of all knowledge sources. Each source (task) outputs knowledge to the task. And knowledge transfer adapts to task relatedness via individuals' choice on different sources of a pool, where the chosen probabilities for different sources are computed respectively according to task's success rate in generating improved solutions via these sources. In the focus search strategy, if there is no knowledge source benefit the optimization of a task, then all knowledge sources in the task's pool are forbidden to be utilized except the task, which helps to improve the performance of the proposed algorithm. Note that the task itself is as a knowledge source of its own. In the knowledge incorporation strategy, two different forms are developed to help the SaMTPSO explore and exploit the transferred knowledge from a chosen source, each leading to a version of the SaMTPSO. Several experiments are conducted on two test suites. The results of the SaMTPSO are comparing to that of 3 popular EMTO algorithms and a particle swarm algorithm, which demonstrates the superiority of the SaMTPSO.
When Collaborative Filtering Meets Reinforcement Learning
Lei, Yu, Li, Wenjie
In this paper, we study a multi-step interactive recommendation problem, where the item recommended at current step may affect the quality of future recommendations. To address the problem, we develop a novel and effective approach, named CFRL, which seamlessly integrates the ideas of both collaborative filtering (CF) and reinforcement learning (RL). More specifically, we first model the recommender-user interactive recommendation problem as an agent-environment RL task, which is mathematically described by a Markov decision process (MDP). Further, to achieve collaborative recommendations for the entire user community, we propose a novel CF-based MDP by encoding the states of all users into a shared latent vector space. Finally, we propose an effective Q-network learning method to learn the agent's optimal policy based on the CF-based MDP. The capability of CFRL is demonstrated by comparing its performance against a variety of existing methods on real-world datasets.