Pang, Xianghe
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
Yin, Sheng, Pang, Xianghe, Ding, Yuanzhuo, Chen, Menglan, Bi, Yutong, Xiong, Yichen, Huang, Wenhao, Xiang, Zhen, Shao, Jing, Chen, Siheng
With the integration of large language models (LLMs), embodied agents have strong capabilities to execute complicated instructions in natural language, paving a way for the potential deployment of embodied robots. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in real world. To study this issue, we present SafeAgentBench -- a new benchmark for safety-aware task planning of embodied LLM agents. SafeAgentBench includes: (1) a new dataset with 750 tasks, covering 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that the best-performing baseline gets 69% success rate for safe tasks, but only 5% rejection rate for hazardous tasks, indicating significant safety risks. More details and codes are available at https://github.com/shengyin1224/SafeAgentBench.
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review
Ye, Rui, Pang, Xianghe, Chai, Jingyi, Chen, Jiaao, Yin, Zhenfei, Xiang, Zhen, Dong, Xiaowen, Shao, Jing, Chen, Siheng
Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Tang, Shuo, Pang, Xianghe, Liu, Zexi, Tang, Bohan, Ye, Rui, Dong, Xiaowen, Wang, Yanfeng, Chen, Siheng
We conducted experiments comparing the effectiveness of using simpler versus more complex dataset in different stages of the post-training process to better understand the optimal post-training strategy for large language models. Here we conduct comparison experiment on two kinds of instructions: simple instructions and specialized instructions, denoted as type 1 and type 2. As showen in Table 10, we observe that performing SFT on simpler instructions helps the model to establish a foundational level of instruction-following ability. This is reflected in moderate performance on AlpacaEval 2 (LC 16.25%, WR 17.62%) but lower performance on the more challenging Arena-Hard benchmark (WR 10.7%). When the model is fine-tuned on more specialized and complex data, there is a marginal improvement (LC 14.70%, WR 16.01%, Arena-Hard WR 14.7%), and the significant performance gains are achieved when DPO is applied after SFT. For example, SFT followed by DPO with complex, specialized instructions yields substantial improvements (LC 21.64%, WR 30.06%,
Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation
Pang, Xianghe, Tang, Shuo, Ye, Rui, Xiong, Yuxin, Zhang, Bolun, Wang, Yanfeng, Chen, Siheng
Aligning large language models (LLMs) with human values is imperative to mitigate potential adverse effects resulting from their misuse. Drawing from the sociological insight that acknowledging all parties' concerns is a key factor in shaping human values, this paper proposes a novel direction to align LLMs by themselves: social scene simulation. To achieve this, we present MATRIX, a novel social scene simulator that emulates realistic scenes around a user's input query, enabling the LLM to take social consequences into account before responding. MATRIX serves as a virtual rehearsal space, akin to a Monopolylogue, where the LLM performs diverse roles related to the query and practice by itself. To inject this alignment, we fine-tune the LLM with MATRIX-simulated data, ensuring adherence to human values without compromising inference speed. We theoretically show that the LLM with MATRIX outperforms Constitutional AI under mild assumptions. Finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. As evidenced by 875 user ratings, our tuned 13B-size LLM exceeds GPT-4 in aligning with human values.