Chen, Yuyan
Open-Set Recognition of Novel Species in Biodiversity Monitoring
Chen, Yuyan, Lang, Nico, Schmidt, B. Christian, Jain, Aditya, Basset, Yves, Beery, Sara, Larrivée, Maxim, Rolnick, David
Machine learning is increasingly being applied to facilitate long-term, large-scale biodiversity monitoring. With most species on Earth still undiscovered or poorly documented, species-recognition models are expected to encounter new species during deployment. We introduce Open-Insects, a fine-grained image recognition benchmark dataset for open-set recognition and out-of-distribution detection in biodiversity monitoring. Open-Insects makes it possible to evaluate algorithms for new species detection on several geographical open-set splits with varying difficulty. Furthermore, we present a test set recently collected in the wild with 59 species that are likely new to science. We evaluate a variety of open-set recognition algorithms, including post-hoc methods, training-time regularization, and training with auxiliary data, finding that the simple post-hoc approach of utilizing softmax scores remains a strong baseline. We also demonstrate how to leverage auxiliary data to improve the detection performance when the training dataset is limited. Our results provide timely insights to guide the development of computer vision methods for biodiversity monitoring and species discovery.
Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios?
Chen, Yuyan, Yu, Tianhao, Li, Yueze, Yan, Songzhou, Liu, Sijia, Liang, Jiaqing, Xiao, Yanghua
The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs' problem-solving capability such as ``Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as ``Who is undercover'' are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the ``Who is undercover'' and ``Twenty Questions'' for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.
HOTVCOM: Generating Buzzworthy Comments for Videos
Chen, Yuyan, Qian, Yiwen, Yan, Songzhou, Jia, Jiyuan, Li, Zhixu, Xiao, Yanghua, Li, Xiaobo, Yang, Ming, Guo, Qingpei
In the era of social media video platforms, popular ``hot-comments'' play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or ``danmaku'' in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces \textsc{HotVCom}, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the \texttt{ComHeat} framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.
EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models
Chen, Yuyan, Wang, Hao, Yan, Songzhou, Liu, Sijia, Li, Yueze, Zhao, Yi, Xiao, Yanghua
Emotional intelligence in large language models (LLMs) is of great importance in Natural Language Processing. However, the previous research mainly focus on basic sentiment analysis tasks, such as emotion recognition, which is not enough to evaluate LLMs' overall emotional intelligence. Therefore, this paper presents a novel framework named EmotionQueen for evaluating the emotional intelligence of LLMs. The framework includes four distinctive tasks: Key Event Recognition, Mixed Event Recognition, Implicit Emotional Recognition, and Intention Recognition. LLMs are requested to recognize important event or implicit emotions and generate empathetic response. We also design two metrics to evaluate LLMs' capabilities in recognition and response for emotion-related statements. Experiments yield significant conclusions about LLMs' capabilities and limitations in emotion intelligence.
Recent Advancement of Emotion Cognition in Large Language Models
Chen, Yuyan, Xiao, Yanghua
Emotion cognition in large language models (LLMs) is crucial for enhancing performance across various applications, such as social media, human-computer interaction, and mental health assessment. We explore the current landscape of research, which primarily revolves around emotion classification, emotionally rich response generation, and Theory of Mind assessments, while acknowledge the challenges like dependency on annotated data and complexity in emotion processing. In this paper, we present a detailed survey of recent progress in LLMs for emotion cognition. We explore key research studies, methodologies, outcomes, and resources, aligning them with Ulric Neisser's cognitive stages. Additionally, we outline potential future directions for research in this evolving field, including unsupervised learning approaches and the development of more complex and interpretable emotion cognition LLMs. We also discuss advanced methods such as contrastive learning used to improve LLMs' emotion cognition capabilities.
Can Pre-trained Language Models Understand Chinese Humor?
Chen, Yuyan, Li, Zhixu, Liang, Jiaqing, Xiao, Yanghua, Liu, Bang, Chen, Yunwen
Humor understanding is an important and challenging research in natural language processing. As the popularity of pre-trained language models (PLMs), some recent work makes preliminary attempts to adopt PLMs for humor recognition and generation. However, these simple attempts do not substantially answer the question: {\em whether PLMs are capable of humor understanding?} This paper is the first work that systematically investigates the humor understanding ability of PLMs. For this purpose, a comprehensive framework with three evaluation steps and four evaluation tasks is designed. We also construct a comprehensive Chinese humor dataset, which can fully meet all the data requirements of the proposed evaluation framework. Our empirical study on the Chinese humor dataset yields some valuable observations, which are of great guiding value for future optimization of PLMs in humor understanding and generation.
Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models
Chen, Yuyan, Fu, Qiang, Yuan, Yichen, Wen, Zhihao, Fan, Ge, Liu, Dayiheng, Zhang, Dongmei, Li, Zhixu, Xiao, Yanghua
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks, including question answering and dialogue systems. However, a major drawback of LLMs is the issue of hallucination, where they generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. In this paper, we propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers. RelD is trained on the constructed RelQA, a bilingual question-answering dialogue dataset along with answers generated by LLMs and a comprehensive set of metrics. Our experimental results demonstrate that the proposed RelD successfully detects hallucination in the answers generated by diverse LLMs. Moreover, it performs well in distinguishing hallucination in LLMs' generated answers from both in-distribution and out-of-distribution datasets. Additionally, we also conduct a thorough analysis of the types of hallucinations that occur and present valuable insights. This research significantly contributes to the detection of reliable answers generated by LLMs and holds noteworthy implications for mitigating hallucination in the future work.
MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization
Chen, Yuyan, Wen, Zhihao, Fan, Ge, Chen, Zhengyu, Wu, Wei, Liu, Dayiheng, Li, Zhixu, Liu, Bang, Xiao, Yanghua
Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.
AICoderEval: Improving AI Domain Code Generation of Large Language Models
Xia, Yinghui, Chen, Yuyan, Shi, Tianyu, Wang, Jun, Yang, Jinsong
Automated code generation is a pivotal capability of large language models (LLMs). However, assessing this capability in real-world scenarios remains challenging. Previous methods focus more on low-level code generation, such as model loading, instead of generating high-level codes catering for real-world tasks, such as image-to-text, text classification, in various domains. Therefore, we construct AICoderEval, a dataset focused on real-world tasks in various domains based on HuggingFace, PyTorch, and TensorFlow, along with comprehensive metrics for evaluation and enhancing LLMs' task-specific code generation capability. AICoderEval contains test cases and complete programs for automated evaluation of these tasks, covering domains such as natural language processing, computer vision, and multimodal learning. To facilitate research in this area, we open-source the AICoderEval dataset at \url{https://huggingface.co/datasets/vixuowis/AICoderEval}. After that, we propose CoderGen, an agent-based framework, to help LLMs generate codes related to real-world tasks on the constructed AICoderEval. Moreover, we train a more powerful task-specific code generation model, named AICoder, which is refined on llama-3 based on AICoderEval. Our experiments demonstrate the effectiveness of CoderGen in improving LLMs' task-specific code generation capability (by 12.00\% on pass@1 for original model and 9.50\% on pass@1 for ReAct Agent). AICoder also outperforms current code generation LLMs, indicating the great quality of the AICoderEval benchmark.
ProSwitch: Knowledge-Guided Instruction Tuning to Generate Professional and Non-Professional Styled Text
Zong, Chang, Chen, Yuyan, Lu, Weiming, Shao, Jian, Zhuang, Yueting
Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including text summarization and controlled text generation. However, studies into their capacity of switching between styles via fine-tuning remain underexplored. This study concentrates on textual professionalism and introduces a novel methodology, named ProSwitch, which equips a language model with the ability to produce both professional and non-professional responses through knowledge-guided instruction tuning. ProSwitch unfolds across three phases: data preparation for gathering domain knowledge and training corpus; instruction tuning for optimizing language models with multiple levels of instruction formats; and comprehensive evaluation for assessing the professionalism discrimination and reference-based quality of generated text. Comparative analysis of ProSwitch against both general and specialized language models reveals that our approach outperforms baselines in switching between professional and non-professional text generation.