Goto

Collaborating Authors

 Personal


SocialEval: Evaluating Social Intelligence of Large Language Models

arXiv.org Artificial Intelligence

LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs' SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs' formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.


Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions

arXiv.org Artificial Intelligence

As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with "eyes and ears" capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation ($M^3C$), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the $M^3C$, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model's strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.


Designing AI Tools for Clinical Care Teams to Support Serious Illness Conversations with Older Adults in the Emergency Department

arXiv.org Artificial Intelligence

Serious illness conversations (SICs), discussions between clinical care teams and patients with serious, life-limiting illnesses about their values, goals, and care preferences, are critical for patient-centered care. Without these conversations, patients often receive aggressive interventions that may not align with their goals. Clinical care teams face significant barriers when conducting serious illness conversations with older adult patients in Emergency Department (ED) settings, where most older adult patients lack documented treatment goals. To understand current practices and identify AI support opportunities, we conducted interviews with two domain experts and nine ED clinical care team members. Through thematic analysis, we characterized a four-phase serious illness conversation workflow (identification, preparation, conduction, documentation) and identified key needs and challenges at each stage. Clinical care teams struggle with fragmented EHR data access, time constraints, emotional preparation demands, and documentation burdens. While participants expressed interest in AI tools for information synthesis, conversational support, and automated documentation, they emphasized preserving human connection and clinical autonomy. We present design guidelines for AI tools supporting SIC workflows that fit within existing clinical practices. This work contributes empirical understanding of ED-based serious illness conversations and provides design considerations for AI in high-stakes clinical environments.


Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation

arXiv.org Artificial Intelligence

Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), an on-the-fly verbalized confidence estimation method for long-form generation. Specifically, we use reinforcement learning (RL) to train LLMs to append numerical confidence scores to each generated statement, serving as a direct and interpretable signal of the factuality of generation. Our experiments consider both on-policy and off-policy RL methods, including DPO, ORPO, and GRPO, to enhance the model calibration. We introduce two novel evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, as it only requires adding a few tokens to the output being decoded.


How to Make AI Faster and Smarter--With a Little Help from Physics

WIRED

The original version of this story appeared in Quanta Magazine. When she was 10 years old, Rose Yu got a birthday present that would change her life--and, potentially, the way we study physics. Her uncle got her a computer. That was a rare commodity in China 25 years ago, and the gift did not go unused. At first, Yu mainly played computer games, but in middle school she won an award for web design.


The Creator of em Succession /em Is Back With a Movie. There's a Reason He Rushed to Make It Right Away.

Slate

Outside an opulent retreat in the mountains of Utah, the world is going to hell. Thanks to disinformation-spreading tools on the world's largest social media platform, people are being executed by bloodthirsty mobs and machine-gunned by their neighbors, politicians assassinated and governments crumbling. But inside Mountainhead, the billionaire tech moguls responsible for the chaos are smoking cigars and shooting the breeze, debating whether the eruption of global chaos is a crisis to be managed or a surge of "creative destruction" that will help usher humanity into a brighter future. If the fictional setting of Mountainhead, the debut feature by Jesse Armstrong, seems a little too close to reality, that's because it's meant to be. The movie, which stars Steve Carell, Jason Schwartzman, Ramy Youssef, and Cory Michael Smith, was conceived, written, cast, shot, edited, and released in about six months, an astonishingly short timeline for any director, let alone a first-timer.


The Real Life Tech Execs That Inspired Jesse Armstrong's Mountainhead

TIME - Tech

Jesse Armstrong loves to pull fictional stories out of reality. His universally acclaimed TV show Succession, for instance, was inspired by real-life media dynasties like the Murdochs and the Hearsts. Mountainhead, which releases on HBO on May 31 at 8 p.m. ET, portrays four top tech executives who retreat to a Utah hideaway as the AI deepfake tools newly released by one of their companies wreak havoc across the world. As the believable deepfakes inflame hatred on social media and real-world violence, the comfortably-appointed quartet mulls a global governmental takeover, intergalactic conquest and immortality, before interpersonal conflict derails their plans. Armstrong tells TIME in a Zoom interview that he first became interested in writing a story about tech titans after reading books like Michael Lewis' Going Infinite (about Sam Bankman-Fried) and Ashlee Vance's Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future, as well as journalistic profiles of Peter Thiel, Marc Andreessen, and others. He then built the story around the interplay between four character archetypes--the father, the dynamo, the usurper, and the hanger-on--and conducted extensive research so that his fictional executives reflected real ones.


AIhub monthly digest: May 2025 – materials design, object state classification, and real-time monitoring for healthcare data

AIHub

Welcome to our monthly digest, where you can catch up with any AIhub stories you may have missed, peruse the latest news, recap recent events, and more. This month, we learn about drug and material design using generative models and Bayesian optimization, find out about a system for real-time monitoring for healthcare data, and explore domain-specific distribution shifts in volunteer-collected biodiversity datasets. Ananya Joshi recently completed her PhD, where she developed a system that experts have used for the past two years to identify respiratory outbreaks (like COVID-19) in large-scale healthcare streams across the United States. In this interview, she tells us more about this project, how healthcare applications inspire basic AI research, and her future plans. Onur Boyar is a PhD student at Nagoya university, working on generative models and Bayesian methods for materials and drug design.


Conversational Alignment with Artificial Intelligence in Context

arXiv.org Artificial Intelligence

The development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and practices and AI design and performance. This article explores what it means for AI agents to be conversationally aligned to human communicative norms and practices for handling context and common ground and proposes a new framework for evaluating developers' design choices. We begin by drawing on the philosophical and linguistic literature on conversational pragmatics to motivate a set of desiderata, which we call the CONTEXT-ALIGN framework, for conversational alignment with human communicative practices. We then suggest that current large language model (LLM) architectures, constraints, and affordances may impose fundamental limitations on achieving full conversational alignment.


Security Benefits and Side Effects of Labeling AI-Generated Images

arXiv.org Artificial Intelligence

Generative artificial intelligence is developing rapidly, impacting humans' interaction with information and digital media. It is increasingly used to create deceptively realistic misinformation, so lawmakers have imposed regulations requiring the disclosure of AI-generated content. However, only little is known about whether these labels reduce the risks of AI-generated misinformation. Our work addresses this research gap. Focusing on AI-generated images, we study the implications of labels, including the possibility of mislabeling. Assuming that simplicity, transparency, and trust are likely to impact the successful adoption of such labels, we first qualitatively explore users' opinions and expectations of AI labeling using five focus groups. Second, we conduct a pre-registered online survey with over 1300 U.S. and EU participants to quantitatively assess the effect of AI labels on users' ability to recognize misinformation containing either human-made or AI-generated images. Our focus groups illustrate that, while participants have concerns about the practical implementation of labeling, they consider it helpful in identifying AI-generated images and avoiding deception. However, considering security benefits, our survey revealed an ambiguous picture, suggesting that users might over-rely on labels. While inaccurate claims supported by labeled AI-generated images were rated less credible than those with unlabeled AI-images, the belief in accurate claims also decreased when accompanied by a labeled AI-generated image. Moreover, we find the undesired side effect that human-made images conveying inaccurate claims were perceived as more credible in the presence of labels.