Wu, Zifan
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
Sun, Xingwu, Chen, Yanfeng, Huang, Yiqing, Xie, Ruobing, Zhu, Jiaqi, Zhang, Kai, Li, Shuaipeng, Yang, Zhen, Han, Jonny, Shu, Xiaobo, Bu, Jiahao, Chen, Zhongzhi, Huang, Xuemeng, Lian, Fengzong, Yang, Saiyong, Yan, Jianfeng, Zeng, Yuyuan, Ren, Xiaoqin, Yu, Chao, Wu, Lulu, Mao, Yue, Xia, Jun, Yang, Tao, Zheng, Suncong, Wu, Kan, Jiao, Dian, Xue, Jinbao, Zhang, Xipeng, Wu, Decheng, Liu, Kai, Wu, Dengpeng, Xu, Guanghui, Chen, Shaohua, Chen, Shuang, Feng, Xiao, Hong, Yigeng, Zheng, Junqiang, Xu, Chengcheng, Li, Zongwei, Kuang, Xiong, Hu, Jianglu, Chen, Yiqi, Deng, Yuchi, Li, Guiyang, Liu, Ao, Zhang, Chenchen, Hu, Shihui, Zhao, Zilong, Wu, Zifan, Ding, Yao, Wang, Weichao, Liu, Han, Wang, Roberts, Fei, Hao, Yu, Peijie, Zhao, Ze, Cao, Xun, Wang, Hai, Xiang, Fusheng, Huang, Mengyuan, Xiong, Zhiyuan, Hu, Bin, Hou, Xuebin, Jiang, Lei, Ma, Jianqiang, Wu, Jiajia, Deng, Yaping, Shen, Yi, Wang, Qian, Liu, Weijie, Liu, Jie, Chen, Meng, Dong, Liang, Jia, Weiwen, Chen, Hu, Liu, Feifei, Yuan, Rui, Xu, Huilin, Yan, Zhenxiang, Cao, Tengfei, Hu, Zhichao, Feng, Xinhua, Du, Dong, Yu, Tinghao, Tao, Yangyu, Zhang, Feng, Zhu, Jianchen, Xu, Chengzhong, Li, Xirui, Zha, Chong, Ouyang, Wen, Xia, Yinben, Li, Xiang, He, Zekun, Chen, Rongpeng, Song, Jiawei, Chen, Ruibin, Jiang, Fan, Zhao, Chongqing, Wang, Bo, Gong, Hao, Gan, Rong, Hu, Winston, Kang, Zhanhui, Yang, Yong, Liu, Yuhong, Wang, Di, Jiang, Jie
In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large
Off-Policy Primal-Dual Safe Reinforcement Learning
Wu, Zifan, Tang, Bo, Lin, Qian, Yu, Chao, Mao, Shangqin, Xie, Qianlong, Wang, Xingxing, Wang, Dong
Primal-dual safe RL methods commonly perform iterations between the primal update of the policy and the dual update of the Lagrange Multiplier. Such a training paradigm is highly susceptible to the error in cumulative cost estimation since this estimation serves as the key bond connecting the primal and dual update processes. We show that this problem causes significant underestimation of cost when using off-policy methods, leading to the failure to satisfy the safety constraint. To address this issue, we propose \textit{conservative policy optimization}, which learns a policy in a constraint-satisfying area by considering the uncertainty in cost estimation. This improves constraint satisfaction but also potentially hinders reward maximization. We then introduce \textit{local policy convexification} to help eliminate such suboptimality by gradually reducing the estimation uncertainty. We provide theoretical interpretations of the joint coupling effect of these two ingredients and further verify them by extensive experiments. Results on benchmark tasks show that our method not only achieves an asymptotic performance comparable to state-of-the-art on-policy methods while using much fewer samples, but also significantly reduces constraint violation during training. Our code is available at https://github.com/ZifanWu/CAL.
Policy-regularized Offline Multi-objective Reinforcement Learning
Lin, Qian, Yu, Chao, Liu, Zongkai, Wu, Zifan
In this paper, we aim to utilize only offline trajectory data to train a policy for multi-objective RL. We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting in order to achieve the above goal. However, such methods face a new challenge in offline MORL settings, namely the preference-inconsistent demonstration problem. We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness. Moreover, we integrate the preference-conditioned scalarized update method into policy-regularized offline RL, in order to simultaneously learn a set of policies using a single policy network, thus reducing the computational cost induced by the training of a large number of individual policies for various preferences. Finally, we introduce Regularization Weight Adaptation to dynamically determine appropriate regularization weights for arbitrary target preferences during deployment. Empirical results on various multi-objective datasets demonstrate the capability of our approach in solving offline MORL problems.
Safe Offline Reinforcement Learning with Real-Time Budget Constraints
Lin, Qian, Tang, Bo, Wu, Zifan, Yu, Chao, Mao, Shangqin, Xie, Qianlong, Wang, Xingxing, Wang, Dong
Many safe RL approaches have been proposed in the past few years (Achiam et al., Aiming at promoting the safe real-world deployment 2017; Zhang et al., 2020; Sootla et al., 2022; Liu et al., of Reinforcement Learning (RL), research 2022a). Unfortunately, most existing approaches only target on safe RL has made significant progress in recent at the online setting, where potentially risky constraint years. However, most existing works in the violations can be incurred during interactions with the real literature still focus on the online setting where environment. As a kind of data-driven methods, offline risky violations of the safety budget are likely to RL (Levine et al., 2020) aims to derive a policy from offline be incurred during training. Besides, in many realworld data without further real-world exploration, and thus is particularly applications, the learned policy is required suitable for safety-critical applications. Despite the to respond to dynamically determined safety budgets recent progress in the offline RL literature (Fujimoto et al., (i.e., constraint threshold) in real time. In 2019; Kumar et al., 2020; Fujimoto & Gu, 2021), however, this paper, we target at the above real-time budget there are still limited works focusing on attaining a safe constraint problem under the offline setting, policy under the offline setting.
Models as Agents: Optimizing Multi-Step Predictions of Interactive Local Models in Model-Based Multi-Agent Reinforcement Learning
Wu, Zifan, Yu, Chao, Chen, Chen, Hao, Jianye, Zhuo, Hankz Hankui
Research in model-based reinforcement learning has made significant progress in recent years. Compared to single-agent settings, the exponential dimension growth of the joint state-action space in multi-agent systems dramatically increases the complexity of the environment dynamics, which makes it infeasible to learn an accurate global model and thus necessitates the use of agent-wise local models. However, during multi-step model rollouts, the prediction of one local model can affect the predictions of other local models in the next step. As a result, local prediction errors can be propagated to other localities and eventually give rise to considerably large global errors. Furthermore, since the models are generally used to predict for multiple steps, simply minimizing one-step prediction errors regardless of their long-term effect on other models may further aggravate the propagation of local errors. To this end, we propose Models as AGents (MAG), a multi-agent model optimization framework that reversely treats the local models as multi-step decision making agents and the current policies as the dynamics during the model rollout process. In this way, the local models are able to consider the multi-step mutual affect between each other before making predictions. Theoretically, we show that the objective of MAG is approximately equivalent to maximizing a lower bound of the true environment return. Experiments on the challenging StarCraft II benchmark demonstrate the effectiveness of MAG.
Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning
Wu, Zifan, Yu, Chao, Chen, Chen, Hao, Jianye, Zhuo, Hankz Hankui
In Model-based Reinforcement Learning (MBRL), model learning is critical since an inaccurate model can bias policy learning via generating misleading samples. However, learning an accurate model can be difficult since the policy is continually updated and the induced distribution over visited states used for model learning shifts accordingly. Prior methods alleviate this issue by quantifying the uncertainty of model-generated samples. However, these methods only quantify the uncertainty passively after the samples were generated, rather than foreseeing the uncertainty before model trajectories fall into those highly uncertain regions. The resulting low-quality samples can induce unstable learning targets and hinder the optimization of the policy. Moreover, while being learned to minimize one-step prediction errors, the model is generally used to predict for multiple steps, leading to a mismatch between the objectives of model learning and model usage. To this end, we propose \emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout process as a sequential decision making problem by reversely considering the model as a decision maker and the current policy as the dynamics. In this way, the model can quickly adapt to the current policy and foresee the multi-step future uncertainty when generating trajectories. Theoretically, we show that the performance of P2P can be guaranteed by approximately optimizing a lower bound of the true environment return. Empirical results demonstrate that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
Coordinated Proximal Policy Optimization
Wu, Zifan, Yu, Chao, Ye, Deheng, Zhang, Junge, Piao, Haiyin, Zhuo, Hankz Hankui
We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. The key idea lies in the coordinated adaptation of step size during the policy update process among multiple agents. We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective, and derive a simplified optimization objective based on a set of approximations. We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies. Finally, we demonstrate that CoPPO outperforms several strong baselines and is competitive with the latest multi-agent PPO method (i.e.