Alberts, Lize
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
Alberts, Lize, Ellis, Benjamin, Lupu, Andrei, Foerster, Jakob
We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (each with 337 use cases) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising user preferences above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.
What makes for a 'good' social actor? Using respect as a lens to evaluate interactions with language agents
Alberts, Lize, Keeling, Geoff, McCroskery, Amanda
With the growing popularity of dialogue agents based on large language models (LLMs), urgent attention has been drawn to finding ways to ensure their behaviour is ethical and appropriate. These are largely interpreted in terms of the 'HHH' criteria: making outputs more helpful and honest, and avoiding harmful (biased, toxic, or inaccurate) statements. Whilst this semantic focus is useful from the perspective of viewing LLM agents as mere mediums for information, it fails to account for pragmatic factors that can make the same utterance seem more or less offensive or tactless in different social situations. We propose an approach to ethics that is more centred on relational and situational factors, exploring what it means for a system, as a social actor, to treat an individual respectfully in a (series of) interaction(s). Our work anticipates a set of largely unexplored risks at the level of situated interaction, and offers practical suggestions to help LLM technologies behave as 'good' social actors and treat people respectfully.
Not Cheating on the Turing Test: Towards Grounded Language Learning in Artificial Intelligence
Alberts, Lize
Recent hype surrounding the increasing sophistication of language processing models has renewed optimism regarding machines achieving a human-like command of natural language. Research in the area of natural language understanding (NLU) in artificial intelligence claims to have been making great strides in this area, however, the lack of conceptual clarity/consistency in how 'understanding' is used in this and other disciplines makes it difficult to discern how close we actually are. In this interdisciplinary research thesis, I integrate insights from cognitive science/psychology, philosophy of mind, and cognitive linguistics, and evaluate it against a critical review of current approaches in NLU to explore the basic requirements--and remaining challenges--for developing artificially intelligent systems with human-like capacities for language use and comprehension.