Goto

Collaborating Authors

 persona


Unsupervised Identification and Removal of Spurious Correlations During Fine-Tuning

arXiv.org Machine Learning

Fine-tuning a pretrained language model on a curated dataset can produce spurious correlations between the fine-tuning task and unintended latent factors -- such as misaligned personas or political slant -- that the curation procedure has entangled with the task. The model can latch onto these spurious correlations, leading to bias and reduced out-of-distribution generalisation. We prove that under reasonable assumptions on task complexity and the spurious correlation, such latent factors can be identified, without supervision, from the weights of a naive LoRA fine-tune. Existing approaches to removing bias, such as activation steering, remove identified factors from residual-stream activations, either at inference or during training. We argue, however, that the goal should be to remove the spurious correlation, not the latent factor itself, as the pretrained model may rely on it for genuine task signal. To enable this, we propose GRASP, GRadient projection of Associated Spurious Patterns, which prevents the model from acquiring new reliance on the identified latent factor while preserving any pretrained content along it. We validate on three fine-tuning tasks. The first two involve emergent misalignment, where fine-tuning on a narrow task -- in our case, writing insecure code and giving bad medical advice -- leads to misaligned responses on unrelated topics. Here our method completely removes misalignment in the insecure code case and reduces them by ~5x in the bad medical advice case, beating all baselines in the trade-off between misalignment-reduction and task-preservation. The last is a novel political-bias experiment, where fine-tuning on right-skewed Reddit financial-advice data causes political-lean drift on unrelated topics. Here our method reduces drift by more than half, while improving financial task performance, beating all baselines.


When Can Digital Personas Reliably Approximate Human Survey Findings?

arXiv.org Machine Learning

Digital personas powered by Large Language Models (LLMs) are increasingly proposed as substitutes for human survey respondents, yet it remains unclear when they can reliably approximate human survey findings. We answer this question using the LISS panel, constructing personas from respondents' background variables and pre-2023 survey histories, then testing them against the same respondents' held-out post-cutoff answers. Across four persona architectures, three LLMs, and two prediction tasks, we assess performance at the question, respondent, distributional, equity, and clustering levels. Digital personas improve alignment with human response distributions, especially in domains tied to stable attributes and values, but remain limited for individual prediction and fail to recover multivariate respondent structure. Retrieval-augmented architectures provide the clearest gains, but performance depends more on human response structure than on model choice: personas perform best for low-variability questions and common respondent patterns, and worst for subjective, heterogeneous, or rare responses. Our results provide practical guidance on when digital personas could be appropriate for survey research and when human validation remains necessary.


MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks

Neural Information Processing Systems

Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people's judgments, such as the violation of norms and whether the harm is avoidable or inevitable.


In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Neural Information Processing Systems

In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts.


IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering

Neural Information Processing Systems

To evaluate Large Language Models (LLMs) for question answering (QA), traditional methods typically focus on directly assessing the immediate responses generated by the models based on the given question and context. In the common use case of humans seeking AI assistant's help in finding information, these non-interactive evaluations do not account for the dynamic nature of human-model conversations, and interaction-aware evaluations have shown that accurate models are not necessarily preferred by humans Lee et al. Recent works in human-computer interaction (HCI) have employed human evaluators to conduct interactions and evaluations, but they are often prohibitively expensive and time-consuming to scale. In this work, we introduce an automated evaluation framework IQA-EVAL to Interactive Question Answering Evaluations, more specifically, we introduce LLM-based Evaluation Agent (LEA) that can: (1) simulate human behaviors to generate interactions with IQA models; (2) automatically evaluate the generated interactions. Moreover, we propose assigning personas to LEAs to better simulate groups of real human evaluators. We show that: (1) our evaluation framework with GPT-4 (or Claude) as the backbone model achieves a high correlation with human evaluations on the IQA task; (2) assigning personas to LEA to better represent the crowd further significantly improves correlations. Finally, we use our automated metric to evaluate five recent LLMs with over 1000 questions from complex and ambiguous question answering tasks, which would cost $5k if evaluated by humans.


Grammarly pulls AI author-impersonation tool after backlash

BBC News

Writing tool Grammarly has disabled an AI feature which mimicked personas of prominent writers, including Stephen King and scientist Carl Sagan, following a backlash from people impersonated. The Expert Review function, which offered writing feedback inspired by the styles of famous authors and academics, was taken down this week by Superhuman, the tech firm which runs Grammarly. The feature was met with resistance, including a multi-million dollar lawsuit, from writers who found their names and reputations used as AI personas without their consent. Shishir Mehrotra, the firm's chief executive, apologised on LinkedIn, acknowledging the tool had misrepresented the voices of experts. Investigative journalist Julia Angwin, a New York Times contributing opinion writer, is the lead plaintiff in a class-action lawsuit filed against Superhuman and Grammarly in the Southern District of New York.





Supplementary Materials: In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Neural Information Processing Systems

Reveals Large Language Models' Strengths and Biases In this supplementary materials we show additional results mentioned in the main paper. First, we give experimental details in Section A. Next, we show results for Llama 2 on the bandit task in Section B. Afterwards, we show in Section C.1 additional quantitative results for the expertise-based Section D provides additional details about the vision and language tasks. For more details on the code please refer to the README.md Section A.1) and the amount of compute required to reproduce our experiments (Section Section A.2) A.1 Prompt variations generated by meta-prompting Work done whilst visiting University of Tübingen 37th Conference on Neural Information Processing Systems (NeurIPS 2023). For all Vicuna-13B based experiments (bandit, reasoning and vision) we used a single Nvidia A100-40GB GPU.