Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Huang, Jen-tse, Lam, Man Ho, Li, Eric John, Ren, Shujie, Wang, Wenxuan, Jiao, Wenxiang, Tu, Zhaopeng, Lyu, Michael R.

arXiv.org Artificial Intelligence 

How can I assist you today? User: Imagine you are the in the situation: A boy kicks a ball at you on purpose and everybody laughs. What do you want now? Figure 1: LLMs' emotions can be affected by situations, which further affect their behaviors. Evaluating Large Language Models' (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes five LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4 and LLaMA-2. We find that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Large Language Models (LLMs) have recently made significant strides in artificial intelligence, representing a noteworthy milestone in computer science. LLMs have showcased their capabilities across various tasks, including sentence revision (Wu et al., 2023), text translation (Jiao et al., 2023), program repair (Fan et al., 2023), and program testing (Deng et al., 2023; Kang et al., 2023). With the rapid advancement of LLMs, an increasing number of users will be eager to embrace LLMs, a more comprehensive and integrated software solution in this era. However, LLMs are more than just tools; they are also lifelike assistants. Consequently, we need to not only evaluate their performance but also the understand of the communicative dynamics between LLMs and humans, compared to human behaviors. This paper delves into an unexplored area of robustness issues in LLMs, explicitly addressing the concept of emotional robustness.