Goto

Collaborating Authors

 dislike


Preliminary Prototyping of Avoidance Behaviors Triggered by a User's Physical Approach to a Robot

Yonezawa, Tomoko, Yamazoe, Hirotake, Fujino, Atsuo, Suhara, Daigo, Tamamoto, Takaya, Nishiguchi, Yuto

arXiv.org Artificial Intelligence

Human-robot interaction frequently involves physical proximity or contact. In human-human settings, people flexibly accept, reject, or tolerate such approaches depending on the relationship and context. We explore the design of a robot's rejective internal state and corresponding avoidance behaviors, such as withdrawing or pushing away, when a person approaches. We model the accumulation and decay of discomfort as a function of interpersonal distance, and implement tolerance (endurance) and limit-exceeding avoidance driven by the Dominance axis of the PAD affect model. The behaviors and their intensities are realized on an arm robot. Results illustrate a coherent pipeline from internal state parameters to graded endurance motions and, once a limit is crossed, to avoidance actions.


GPT-5 Doesn't Dislike You--It Might Just Need a Benchmark for Emotional Intelligence

WIRED

Since the all-new ChatGPT launched on Thursday, some users have mourned the disappearance of a peppy and encouraging personality in favor of a colder, more businesslike one (a move seemingly designed to reduce unhealthy user behavior.) The backlash shows the challenge of building artificial intelligence systems that exhibit anything like real emotional intelligence. Researchers at MIT have proposed a new kind of AI benchmark to measure how AI systems can manipulate and influence their users--in both positive and negative ways--in a move that could perhaps help AI builders avoid similar backlashes in the future while also keeping vulnerable users safe. Most benchmarks try to gauge intelligence by testing a model's ability to answer exam questions, solve logical puzzles, or come up with novel answers to knotty math problems. As the psychological impact of AI use becomes more apparent, we may see MIT propose more benchmarks aimed at measuring more subtle aspects of intelligence as well as machine-to-human interactions.


'House of the Dragon' Actor's New Horror Game Skewers Hollywood

WIRED

Abubakar Salim has a lot of beef with Hollywood--and he's getting it off his chest in his latest video game. The actor, known for his roles as Alyn of Hull on House of the Dragon and Father in Raised By Wolves, has been balancing his time between the big screen and gaming, two industries that have been affected by a slew of similar issues: long hours, shrinking jobs, abuse of power, and, more recently, the rapid rise of artificial intelligence use and generative AI. Salim's sophomore game, Dead Take, is a story of Hollywood, ambition, and exploitation, dressed up as a horror game that takes aim at his industry's problems, from corruption to AI use. "Hollywood is pure horror," Salim says. Dead Take is a firm departure from his debut game, Tales of Kenzera: Zau.


PersonaX: A Recommendation Agent Oriented User Modeling Framework for Long Behavior Sequence

Shi, Yunxiao, Xu, Wujiang, Zhang, Zeqi, Zi, Xing, Wu, Qiang, Xu, Min

arXiv.org Artificial Intelligence

Recommendation agents leverage large language models for user modeling LLM UM to construct textual personas guiding alignment with real users. However existing LLM UM methods struggle with long user generated content UGC due to context limitations and performance degradation. To address this sampling strategies prioritize relevance or recency are often applied yet they inevitably neglect the diverse user interests embedded within the discarded behaviors resulting in incomplete modeling and degraded profiling quality. Furthermore relevance based sampling requires real time retrieval forcing the user modeling process to operate online which introduces significant latency overhead. In this paper we propose PersonaX an agent agnostic LLM UM framework that tackles these challenges through sub behavior sequence SBS selection and offline multi persona construction. PersonaX extracts compact SBS segments offline to capture diverse user interests generating fine grained textual personas that are cached for efficient online retrieval. This approach ensures that the user persona used for prompting remains highly relevant to the current context while eliminating the need for online user modeling. For SBS selection we ensure both efficiency length less than five and high representational quality by balancing prototypicality and diversity within the sampled data. Extensive experiments validate the effectiveness and versatility of PersonaX in high quality user profiling. Utilizing only 30 to 50 percent of the behavioral data with a sequence length of 480 integrating PersonaX with AgentCF yields an absolute performance improvement of 3 to 11 percent while integration with Agent4Rec results in a gain of 10 to 50 percent. PersonaX as an agent agnostic framework sets a new benchmark for scalable user modeling paving the way for more accurate and efficient LLM driven recommendation agents.


Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation

Shimizu, Ryotaro, Wada, Takashi, Wang, Yu, Kruse, Johannes, O'Brien, Sean, HtaungKham, Sai, Song, Linxin, Yoshikawa, Yuya, Saito, Yuki, Tsung, Fugee, Goto, Masayuki, McAuley, Julian

arXiv.org Artificial Intelligence

Recent research on explainable recommendation generally frames the task as a standard text generation problem, and evaluates models simply based on the textual similarity between the predicted and ground-truth explanations. However, this approach fails to consider one crucial aspect of the systems: whether their outputs accurately reflect the users' (post-purchase) sentiments, i.e., whether and why they would like and/or dislike the recommended items. To shed light on this issue, we introduce new datasets and evaluation methods that focus on the users' sentiments. Specifically, we construct the datasets by explicitly extracting users' positive and negative opinions from their post-purchase reviews using an LLM, and propose to evaluate systems based on whether the generated explanations 1) align well with the users' sentiments, and 2) accurately identify both positive and negative opinions of users on the target items. We benchmark several recent models on our datasets and demonstrate that achieving strong performance on existing metrics does not ensure that the generated explanations align well with the users' sentiments. Lastly, we find that existing models can provide more sentiment-aware explanations when the users' (predicted) ratings for the target items are directly fed into the models as input. We will release our code and datasets upon acceptance.


FuseChat: Knowledge Fusion of Chat Models

Wan, Fanqi, Zhong, Longguang, Yang, Ziyi, Chen, Ruijun, Quan, Xiaojun

arXiv.org Artificial Intelligence

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseAI}.


Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

Lin, Ying-Chun, Neville, Jennifer, Stokes, Jack W., Yang, Longqi, Safavi, Tara, Wan, Mengting, Counts, Scott, Suri, Siddharth, Andersen, Reid, Xu, Xiaofeng, Gupta, Deepak, Jauhar, Sujay Kumar, Song, Xia, Buscher, Georg, Tiwary, Saurabh, Hecht, Brent, Teevan, Jaime

arXiv.org Artificial Intelligence

Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. The resulting method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.


Embedding-Aligned Language Models

Tennenholtz, Guy, Chow, Yinlam, Hsu, Chih-Wei, Shani, Lior, Liang, Ethan, Boutilier, Craig

arXiv.org Artificial Intelligence

We propose a novel approach for training large language models (LLMs) to adhere to objectives defined within a latent embedding space. Our method leverages reinforcement learning (RL), treating a pre-trained LLM as an environment. Our embedding-aligned guided language (EAGLE) agent is trained to iteratively steer the LLM's generation towards optimal regions of the latent embedding space, w.r.t. some predefined criterion. We demonstrate the effectiveness of the EAGLE agent using the MovieLens 25M dataset to surface content gaps that satisfy latent user demand. We also demonstrate the benefit of using an optimal design of a state-dependent action set to improve EAGLE's efficiency. Our work paves the way for controlled and grounded text generation using LLMs, ensuring consistency with domain-specific knowledge and data representations.


Measuring Strategization in Recommendation: Users Adapt Their Behavior to Shape Future Content

Cen, Sarah H., Ilyas, Andrew, Allen, Jennifer, Li, Hannah, Madry, Aleksander

arXiv.org Artificial Intelligence

Most modern recommendation algorithms are data-driven: they generate personalized recommendations by observing users' past behaviors. A common assumption in recommendation is that how a user interacts with a piece of content (e.g., whether they choose to "like" it) is a reflection of the content, but not of the algorithm that generated it. Although this assumption is convenient, it fails to capture user strategization: that users may attempt to shape their future recommendations by adapting their behavior to the recommendation algorithm. In this work, we test for user strategization by conducting a lab experiment and survey. To capture strategization, we adopt a model in which strategic users select their engagement behavior based not only on the content, but also on how their behavior affects downstream recommendations. Using a custom music player that we built, we study how users respond to different information about their recommendation algorithm as well as to different incentives about how their actions affect downstream outcomes. We find strong evidence of strategization across outcome metrics, including participants' dwell time and use of "likes." For example, participants who are told that the algorithm mainly pays attention to "likes" and "dislikes" use those functions 1.9x more than participants told that the algorithm mainly pays attention to dwell time. A close analysis of participant behavior (e.g., in response to our incentive conditions) rules out experimenter demand as the main driver of these trends. Further, in our post-experiment survey, nearly half of participants self-report strategizing "in the wild," with some stating that they ignore content they actually like to avoid over-recommendation of that content in the future. Together, our findings suggest that user strategization is common and that platforms cannot ignore the effect of their algorithms on user behavior.


Dynamic pricing with Bayesian updates from online reviews

Correa, José, Mari, Mathieu, Xia, Andrew

arXiv.org Artificial Intelligence

As a key part of modern online platforms, online decision-making plays a crucial role in a variety of settings, particularly related to the Internet. Two landmark examples that have been widely studied are dynamic pricing and online reviews. Online review systems constitute powerful platforms for users to get informed about the product and for the firm to understand how a given market is receiving the product. The study of these systems has been vast for the last two decades [6, 10], and more recently, modeling simple like/dislike reviews as bandits problems have become standard [1, 2, 3, 13, 16, 18]. Dynamic pricing, on the other hand, is an active area of research in economics, computer science, and operations research [12, 14], and has become a common practice in several industries such as transportation and retail. There has been a growing interest in combining the two areas as a way to design more effective pricing mechanisms that gather information from current reviews to update prices and make the product more attractive [5, 11, 17]. In particular, [5] considers social learning with non-Bayesian agents in a market with like & dislike reviews, and the resulting pricing decision of a monopolist.