Goto

Collaborating Authors

 Kong, Shuqi


Reflection-Bench: probing AI intelligence with reflection

arXiv.org Artificial Intelligence

The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning Figure 1: Reflection, a fundamental process of intelligence, core cognitive functions crucial for reflection, integrates various cognitive components. To including perception, memory, belief updating, achieve desired outcomes, an intelligent agent must decision-making, prediction, counterfactual predict the external world states and behavioral consequences thinking, and meta-reflection. We evaluate based on prior beliefs. Post-action, discrepancies the performances of 13 prominent LLMs between prediction and observation are perceived, such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, prompting an update of prior belief.


ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

arXiv.org Artificial Intelligence

Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.