Personal
Do as We Do, Not as You Think: the Conformity of Large Language Models
Weng, Zhiyuan, Chen, Guikun, Wang, Wenguan
Recent advancements in large language models (LLMs) revolutionize the field of intelligent agents, enabling collaborative multi-agent systems capable of tackling complex problems across various domains. However, the potential of conformity within these systems, analogous to phenomena like conformity bias and groupthink in human group dynamics, remains largely unexplored, raising concerns about their collective problem-solving capabilities and possible ethical implications. This paper presents a comprehensive study on conformity in LLM-driven multi-agent systems, focusing on three aspects: the existence of conformity, the factors influencing conformity, and potential mitigation strategies. In particular, we introduce BenchForm, a new conformity-oriented benchmark, featuring reasoning-intensive tasks and five distinct interaction protocols designed to probe LLMs' behavior in collaborative scenarios. Several representative LLMs are evaluated on BenchForm, using metrics such as conformity rate and independence rate to quantify conformity's impact. Our analysis delves into factors influencing conformity, including interaction time and majority size, and examines how the subject agent rationalizes its conforming behavior. Furthermore, we explore two strategies to mitigate conformity effects, i.e., developing enhanced personas and implementing a reflection mechanism. Several interesting findings regarding LLMs' conformity are derived from empirical results and case studies. We hope that these insights can pave the way for more robust and ethically-aligned collaborative AI systems. Our benchmark and code are available at BenchForm.
GuideLLM: Exploring LLM-Guided Conversation with Applications in Autobiography Interviewing
Duan, Jinhao, Zhao, Xinyu, Zhang, Zhuoxuan, Ko, Eunhye, Boddy, Lily, Wang, Chenan, Li, Tianhao, Rasgon, Alexander, Hong, Junyuan, Lee, Min Kyung, Yuan, Chenxi, Long, Qi, Ding, Ying, Chen, Tianlong, Xu, Kaidi
Although Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering, the potential of LLM-guided conversations-where LLMs direct the discourse and steer the conversation's objectives-remains under-explored. In this study, we first characterize LLM-guided conversation into three fundamental components: (i) Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement, and propose GuideLLM as an installation. We then implement an interviewing environment for the evaluation of LLM-guided conversation. Specifically, various topics are involved in this environment for comprehensive interviewing evaluation, resulting in around 1.4k turns of utterances, 184k tokens, and over 200 events mentioned during the interviewing for each chatbot evaluation. We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality. For automatic evaluation, we derive user proxies from multiple autobiographies and employ LLM-as-a-judge to score LLM behaviors. We further conduct a human-involved experiment by employing 45 human participants to chat with GuideLLM and baselines. We then collect human feedback, preferences, and ratings regarding the qualities of conversation and autobiography. Experimental results indicate that GuideLLM significantly outperforms baseline LLMs in automatic evaluation and achieves consistent leading performances in human ratings.
HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents
Abbasi, Mohammad Amin, Mirnezami, Farnaz Sadat, Naderi, Hassan
This paper presents HamRaz, a novel Persian-language mental health dataset designed for Person-Centered Therapy (PCT) using Large Language Models (LLMs). Despite the growing application of LLMs in AI-driven psychological counseling, existing datasets predominantly focus on Western and East Asian contexts, overlooking cultural and linguistic nuances essential for effective Persian-language therapy. To address this gap, HamRaz combines script-based dialogues with adaptive LLM role-playing, ensuring coherent and dynamic therapy interactions. We also introduce HamRazEval, a dual evaluation framework that measures conversational quality and therapeutic effectiveness using General Dialogue Metrics and the Barrett-Lennard Relationship Inventory (BLRI). Experimental results show HamRaz outperforms conventional Script Mode and Two-Agent Mode, producing more empathetic, context-aware, and realistic therapy sessions. By releasing HamRaz, we contribute a culturally adapted, LLM-driven resource to advance AI-powered psychotherapy research in diverse communities.
Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs
Meng, Han, Zhang, Renwen, Wang, Ganyi, Yang, Yitian, Qin, Peinuan, Lee, Jungup, Lee, Yi-Chieh
Mental-illness stigma is a persistent social problem, hampering both treatment-seeking and recovery. Accordingly, there is a pressing need to understand it more clearly, but analyzing the relevant data is highly labor-intensive. Therefore, we designed a chatbot to engage participants in conversations; coded those conversations qualitatively with AI assistance; and, based on those coding results, built causal knowledge graphs to decode stigma. The results we obtained from 1,002 participants demonstrate that conversation with our chatbot can elicit rich information about people's attitudes toward depression, while our AI-assisted coding was strongly consistent with human-expert coding. Our novel approach combining large language models (LLMs) and causal knowledge graphs uncovered patterns in individual responses and illustrated the interrelationships of psychological constructs in the dataset as a whole. The paper also discusses these findings' implications for HCI researchers in developing digital interventions, decomposing human psychological constructs, and fostering inclusive attitudes.
Review for NeurIPS paper: Model Class Reliance for Random Forests
This is a relevant and timely paper that has been reviewed by four knowledgeable referees, who also thoroughly considered the author's response to their initial reviews. Three of these reviewers recommend acceptance, providing detailed suggestions on how to improve this work before its final submission. This dissenting opinion was upheld by R3 after discussion with other referees. R3 in my opinion correctly brings up that if the proposed approach aims to improve runtime with an approximate algorithm, this must be sufficiently demonstrated in experiments vs. straightforward alternatives (such as retraining-based methods). That has not been done in the original submission neither in the rebuttal.
Review for NeurIPS paper: Curriculum By Smoothing
Weaknesses: - The authors compared their method to the baseline approach only. However, there are plenty of curriculum learning methods that could have been used as relevant state-of-the-art competing methods to compare with, e.g. Comparison with such competing methods is mandatory, in my opinion. I believe that the non-linearity is typically applied before the pooling operation. Even so, it is not clear why it works so well.
RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
Yang, Ziqi, Lu, Yuxuan, Bagdasarian, Jennifer, Swain, Vedant Das, Agarwal, Ritu, Campbell, Collin, Al-Refaire, Waddah, El-Bayoumi, Jehan, Gao, Guodong, Wang, Dakuo, Yao, Bingsheng, Shara, Nawar
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.
KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy
Kim, Hyunjong, Lee, Suyeon, Cho, Yeongjae, Ryu, Eunseo, Jo, Yohan, Seong, Suran, Cho, Sungzoon
The increasing demand for mental health services has led to the rise of AI-driven mental health chatbots, though challenges related to privacy, data collection, and expertise persist. Motivational Interviewing (MI) is gaining attention as a theoretical basis for boosting expertise in the development of these chatbots. However, existing datasets are showing limitations for training chatbots, leading to a substantial demand for publicly available resources in the field of MI and psychotherapy. These challenges are even more pronounced in non-English languages, where they receive less attention. In this paper, we propose a novel framework that simulates MI sessions enriched with the expertise of professional therapists. We train an MI forecaster model that mimics the behavioral choices of professional therapists and employ Large Language Models (LLMs) to generate utterances through prompt engineering. Then, we present KMI, the first synthetic dataset theoretically grounded in MI, containing 1,000 high-quality Korean Motivational Interviewing dialogues. Through an extensive expert evaluation of the generated dataset and the dialogue model trained on it, we demonstrate the quality, expertise, and practicality of KMI. We also introduce novel metrics derived from MI theory in order to evaluate dialogues from the perspective of MI.
Inside France's Effort to Shape the Global AI Conversation
One evening early last year, Anne Bouverot was putting the finishing touches on a report when she received an urgent phone call. It was one of French President Emmanuel Macron's aides offering her the role as his special envoy on artificial intelligence. The unpaid position would entail leading the preparations for the France AI Action Summit--a gathering where heads of state, technology CEOs, and civil society representatives will seek to chart a course for AI's future. Set to take place on Feb. 10 and 11 at the presidential รlysรฉe Palace in Paris, it will be the first such gathering since the virtual Seoul AI Summit in May--and the first in-person meeting since November 2023, when world leaders descended on Bletchley Park for the U.K.'s inaugural AI Safety Summit. After weighing the offer, Bouverot, who was at the time the co-chair of France's AI Commission, accepted. But France's Summit won't be like the others.
Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning
Zheng, Bokeng, Rao, Bo, Zhu, Tianxiang, Tan, Chee Wei, Duan, Jingpu, Zhou, Zhi, Chen, Xu, Zhang, Xiaoxi
Abstract--Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living. Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and contribute towards urban intelligence, despite resource limitations. Therefore, this work explores a promising scenario, where edge-assisted vehicles perform joint tasks of order serving and the emerging foundation model finetuning using various urban data. However, integrating the VCS AI task with the conventional order serving task is challenging, due to their inconsistent spatio-temporal characteristics: (i) The distributions of ride orders and data point-of-interests (PoIs) may not coincide in geography, both following a priori unknown patterns; (ii) they have distinct forms of temporal effects, i.e., prolonged waiting makes orders become instantly invalid while data with increased staleness gradually reduces its utility for model fine-tuning. To overcome these obstacles, we propose an online framework based on multi-agent reinforcement learning (MARL) with careful augmentation. A new quality-of-service (QoS) metric is designed to characterize and balance the utility of the two joint tasks, under the effects of varying data volumes and staleness. Each RSU, equipped with a server, stores a complete base model, enabling vehicles to perform real-time fine-tuning as they collect data and transfer the I. X. Zhang are with the School of Computer Science and A previous version appears at IWQoS 2024 as a short paper. Due to the large volume, data stored in the government agencies in better city management. Notably, ridehailing RSU server can be discarded in a certain period of time. In vehicles are particularly advantageous for VCS tasks, practice, these data can be descriptive features and feedbacks due to their centralized ride-hailing platform management, (labels) of recommendation or generative AR applications, which reduces the cost of deploying and executing crowdsensing generated by nearby visitors or residents. They can also be tasks, and utilizes the data and computing resources traffic/environment monitoring data with labels generated by from ride-hailing vehicles to maximize the VCS task utilities. The government or any company that collaborates model (FM)-powered AI applications have revolutionized with the ride-hailing vehicle company has multiple types of numerous aspects of human lives, including healthcare, education, VSC tasks to fulfill, each of which needs certain locations industry, etc. FMs, e.g., BERT, GPT-4, ViT, serve of data for fine-tuning UFMs.