please rate
How AI Companionship Develops: Evidence from a Longitudinal Study
Hwang, Angel Hsing-Chi, Li, Fiona, Anthis, Jacy Reese, Noh, Hayoun
The quickly growing popularity of AI companions poses risks to mental health, personal wellbeing, and social relationships. Past work has identified many individual factors that can drive human-companion interaction, but we know little about how these factors interact and evolve over time. In Study 1, we surveyed AI companion users (N = 303) to map the psychological pathway from users' mental models of the agent to parasocial experiences, social interaction, and the psychological impact of AI companions. Participants' responses foregrounded multiple interconnected variables (agency, parasocial interaction, and engagement) that shape AI companionship. In Study 2, we conducted a longitudinal study with a subset of participants (N = 110) using a new generic chatbot. Participants' perceptions of the generic chatbot significantly converged to perceptions of their own companions by Week 3. These results suggest a longitudinal model of AI companionship development and demonstrate an empirical method to study human-AI companionship.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
- North America > United States > New York > New York County > New York City (0.14)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Media (0.92)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.45)
The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support
BN, Suhas, Mahajan, Yash, Mattioli, Dominik, Sherrill, Andrew M., Arriaga, Rosa I., Wiese, Chris W., Abdullah, Saeed
This paper investigates the capacity of small language models (0.5B-5B parameters) to generate empathetic responses for individuals with PTSD. We introduce Trauma-Informed Dialogue for Empathy (TIDE), a novel dataset comprising 10,000 two-turn conversations across 500 diverse, clinically-grounded PTSD personas (https://huggingface.co/datasets/yenopoya/TIDE). Using frontier model outputs as ground truth, we evaluate eight small LLMs in zero-shot settings and after fine-tuning. Fine-tuning enhances empathetic capabilities, improving cosine similarity and perceived empathy, although gains vary across emotional scenarios and smaller models exhibit a "knowledge transfer ceiling." As expected, Claude Sonnet 3.5 consistently outperforms all models, but surprisingly, the smaller models often approach human-rated empathy levels. Demographic analyses showed that older adults favored responses that validated distress before offering support (p = .004), while graduate-educated users preferred emotionally layered replies in specific scenarios. Gender-based differences were minimal (p > 0.15), suggesting the feasibility of broadly empathetic model designs. This work offers insights into building resource-efficient, emotionally intelligent systems for mental health support.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.67)
Assessing Similarity Measures for the Evaluation of Human-Robot Motion Correspondence
Dietzel, Charles, Martin, Patrick J.
One key area of research in Human-Robot Interaction is solving the human-robot correspondence problem, which asks how a robot can learn to reproduce a human motion demonstration when the human and robot have different dynamics and kinematic structures. Evaluating these correspondence problem solutions often requires the use of qualitative surveys that can be time consuming to design and administer. Additionally, qualitative survey results vary depending on the population of survey participants. In this paper, we propose the use of heterogeneous time-series similarity measures as a quantitative evaluation metric for evaluating motion correspondence to complement these qualitative surveys. To assess the suitability of these measures, we develop a behavioral cloning-based motion correspondence model, and evaluate it with a qualitative survey as well as quantitative measures. By comparing the resulting similarity scores with the human survey results, we identify Gromov Dynamic Time Warping as a promising quantitative measure for evaluating motion correspondence.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Virginia > Richmond (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.46)
Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models
Groot, Tobias, Valdenegro-Toro, Matias
Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration. Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95% confidence intervals.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.30)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
- Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.05)
- (6 more...)
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling
Yang, Ruihan, Gamper, Hannes, Braun, Sebastian
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. Recognizing the importance of accurate alignment between video and audio events in multi-modal generation tasks, we propose a joint contrastive training loss to enhance the synchronization between visual and auditory occurrences. Our research methodology involves conducting comprehensive experiments on multiple datasets to thoroughly evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline, substantiating its effectiveness and efficiency. Notably, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task. These results indicate the potential of our proposed model as a robust solution for improving the quality and alignment of multi-modal generation, thereby contributing to the advancement of video and audio conditional generation systems.
- North America > United States > California > Orange County > Irvine (0.14)
- North America > United States > Washington > King County > Redmond (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (4 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Vision (0.95)
Large Language Models are biased to overestimate profoundness
Herrera-Berg, Eugenio, Browne, Tomás Vergara, León-Villagrá, Pablo, Vives, Marc-Lluís, Calderon, Cristian Buc
Recent advancements in natural language processing by large language models (LLMs), such as GPT-4, have been suggested to approach Artificial General Intelligence. And yet, it is still under dispute whether LLMs possess similar reasoning abilities to humans. This study evaluates GPT-4 and various other LLMs in judging the profoundness of mundane, motivational, and pseudo-profound statements. We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used. However, LLMs systematically overestimate the profoundness of nonsensical statements, with the exception of Tk-instruct, which uniquely underestimates the profoundness of statements. Only few-shot learning prompts, as opposed to chain-of-thought prompting, draw LLMs ratings closer to humans. Furthermore, this work provides insights into the potential biases induced by Reinforcement Learning from Human Feedback (RLHF), inducing an increase in the bias to overestimate the profoundness of statements.
- South America > Chile (0.04)
- South America > Brazil (0.04)
- North America > United States > Florida > Hillsborough County > University (0.04)
- (2 more...)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.47)
Personality Traits in Large Language Models
Serapio-García, Greg, Safdari, Mustafa, Crepy, Clément, Sun, Luning, Fitz, Stephen, Romero, Peter, Abdulhai, Marwa, Faust, Aleksandra, Matarić, Maja
The advent of large language models (LLMs) has revolutionized natural language processing, enabling the generation of coherent and contextually relevant human-like text. As LLMs increasingly power conversational agents used by the general public world-wide, the synthetic personality embedded in these models, by virtue of training on large amounts of human data, is becoming increasingly important. Since personality is a key factor determining the effectiveness of communication, we present a comprehensive method for administering and validating personality tests on widely-used LLMs, as well as for shaping personality in the generated text of such LLMs. Applying this method, we found: 1) personality measurements in the outputs of some LLMs under specific prompting configurations are reliable and valid; 2) evidence of reliability and validity of synthetic LLM personality is stronger for larger and instruction fine-tuned models; and 3) personality in LLM outputs can be shaped along desired dimensions to mimic specific human personality profiles. We discuss application and ethical implications of the measurement and shaping method, in particular regarding responsible AI.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (10 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Media (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Education (1.00)
- (2 more...)
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
The recent success of ChatGPT and GPT-4 has drawn widespread attention to multimodal dialogue systems. However, the academia community lacks a dataset that can validate the multimodal generation capabilities of Visual Language Models (VLMs) in textual-visual chat tasks. In this paper, we construct two new multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K), both featuring visual and text-based inputs and outputs. Additionally, to enable the multimodal system to reject human requests (i.e., demonstrate accountability), as in language-based ChatGPT conversations, we develop and incorporate specific rules into the datasets as supervisory signals. This allows the trained VLM to provide a yes or no answer after visual and textual reasoning, accompanied by a language explanation as to why the human instruction cannot be excuted. In our method, we propose a two-state training procedure to train the image auto-encoder and auto-regressive transformer from scratch. The first state involves a discrete variational autoencoder (dVAE) to compress each image into short tokens, which are then concatenated with text tokens as a single data stream to be fed into the decoder-based transformer for generating visual re-creation and textual feedback in the second state. We provide comprehensive analyses of experimental results in terms of re-created image quality, answer accuracy, and the model behavior when faced with uncertainty and imperfect user queries. We hope our explorations and findings contribute valuable insights regarding the accountability of textual-visual generative models.
- North America > United States (0.14)
- Asia > China > Hong Kong (0.04)