Goto

Collaborating Authors

 average rating



Spiral of Silence in Large Language Model Agents

Zhong, Mingze, Fang, Meng, Shi, Zijing, Huang, Yuxuan, Zheng, Shunfeng, Du, Yali, Chen, Ling, Wang, Jun

arXiv.org Artificial Intelligence

The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the 'agents' are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS-like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of 'History' and 'Persona' signals. Opinion dynamics are assessed using trend tests such as Mann-Kendall and Spearman's rank, along with concentration measures including kurtosis and interquartile range. Experiments across open-source and closed-source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.



Datasets for Navigating Sensitive Topics in Recommendation Systems

Kovacs, Amelia, Chee, Jerry, Kazemian, Kimia, Dean, Sarah

arXiv.org Artificial Intelligence

Personalized AI systems, from recommendation systems to chatbots, are a prevalent method for distributing content to users based on their learned preferences. However, there is growing concern about the adverse effects of these systems, including their potential tendency to expose users to sensitive or harmful material, negatively impacting overall well-being. To address this concern quantitatively, it is necessary to create datasets with relevant sensitivity labels for content, enabling researchers to evaluate personalized systems beyond mere engagement metrics. To this end, we introduce two novel datasets that include a taxonomy of sensitivity labels alongside user-content ratings: one that integrates MovieLens rating data with content warnings from the Does the Dog Die? community ratings website, and another that combines fan-fiction interaction data and user-generated warnings from Archive of Our Own.


Social Influence Distorts Ratings in Online Interfaces

Kontalexi, Marina, Gelastopoulos, Alexandros, Analytis, Pantelis P.

arXiv.org Artificial Intelligence

Theoretical work on sequential choice and large-scale experiments in online ranking and voting systems has demonstrated that social influence can have a drastic impact on social and technological systems. Yet, the effect of social influence on online rating systems remains understudied and the few existing contributions suggest that online ratings would self-correct given enough users. Here, we propose a new framework for studying the effect of social influence on online ratings. We start from the assumption that people are influenced linearly by the observed average rating, but postulate that their propensity to be influenced varies. When the weight people assign to the observed average depends only on their own latent rating, the resulting system is linear, but the long-term rating may substantially deviate from the true mean rating. When the weight people put on the observed average depends on both their own latent rating and the observed average rating, the resulting system is non-linear, and may support multiple equilibria, suggesting that ratings might be path-dependent and deviations dramatic. Our results highlight potential limitations in crowdsourced information aggregation and can inform the design of more robust online rating systems.


A Measure of the System Dependence of Automated Metrics

von Däniken, Pius, Deriu, Jan, Cieliebak, Mark

arXiv.org Artificial Intelligence

Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.


End-to-end Training for Recommendation with Language-based User Profiles

Gao, Zhaolin, Zhou, Joyce, Dai, Yijia, Joachims, Thorsten

arXiv.org Artificial Intelligence

Many online platforms maintain user profiles for personalization. Unfortunately, these profiles are typically not interpretable or easily modifiable by the user. To remedy this shortcoming, we explore natural language-based user profiles, as they promise enhanced transparency and scrutability of recommender systems. While existing work has shown that language-based profiles from standard LLMs can be effective, such generalist LLMs are unlikely to be optimal for this task. In this paper, we introduce LangPTune, the first end-to-end learning method for training LLMs to produce language-based user profiles that optimize recommendation effectiveness. Through comprehensive evaluations of LangPTune across various training configurations and benchmarks, we demonstrate that our approach significantly outperforms existing profile-based methods. In addition, it approaches performance levels comparable to state-of-the-art, less transparent recommender systems, providing a robust and interpretable alternative to conventional systems. Finally, we validate the relative interpretability of these language-based user profiles through user studies involving crowdworkers and GPT-4-based evaluations. Implementation of LangPTune can be found at https://github.com/ZhaolinGao/LangPTune.


Performance of Recent Large Language Models for a Low-Resourced Language

Jayakody, Ravindu, Dias, Gihan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown significant advances in the past year. In addition to new versions of GPT and Llama, several other LLMs have been introduced recently. Some of these are open models available for download and modification. Although multilingual large language models have been available for some time, their performance on low-resourced languages such as Sinhala has been poor. We evaluated four recent LLMs on their performance directly in the Sinhala language, and by translation to and from English. We also evaluated their fine-tunability with a small amount of fine-tuning data. Claude and GPT 4o perform well out-of-the-box and do significantly better than previous versions. Llama and Mistral perform poorly but show some promise of improvement with fine tuning.


Understanding Subjectivity through the Lens of Motivational Context in Model-Generated Image Satisfaction

Dutta, Senjuti, Chen, Sherol, Mak, Sunny, Ahmad, Amnah, Collins, Katherine, Butryna, Alena, Ramachandran, Deepak, Dvijotham, Krishnamurthy, Pavlick, Ellie, Rajakumar, Ravi

arXiv.org Artificial Intelligence

Image generation models are poised to become ubiquitous in a range of applications. These models are often fine-tuned and evaluated using human quality judgments that assume a universal standard, failing to consider the subjectivity of such tasks. To investigate how to quantify subjectivity, and the scale of its impact, we measure how assessments differ among human annotators across different use cases. Simulating the effects of ordinarily latent elements of annotators subjectivity, we contrive a set of motivations (t-shirt graphics, presentation visuals, and phone background images) to contextualize a set of crowdsourcing tasks. Our results show that human evaluations of images vary within individual contexts and across combinations of contexts. Three key factors affecting this subjectivity are image appearance, image alignment with text, and representation of objects mentioned in the text. Our study highlights the importance of taking individual users and contexts into account, both when building and evaluating generative models


Performance rating in chess, tennis, and other contexts

Ismail, Mehmet S.

arXiv.org Artificial Intelligence

In this note, I introduce Estimated Performance Rating (PR$^e$), a novel system for evaluating player performance in sports and games. PR$^e$ addresses a key limitation of the Tournament Performance Rating (TPR) system, which is undefined for zero or perfect scores in a series of games. PR$^e$ is defined as the rating that solves an optimization problem related to scoring probability, making it applicable for any performance level. The main theorem establishes that the PR$^e$ of a player is equivalent to the TPR whenever the latter is defined. I then apply this system to historically significant win-streaks in association football, tennis, and chess. Beyond sports, PR$^e$ has broad applicability in domains where Elo ratings are used, from college rankings to the evaluation of large language models.