AITopics | Röttger, Paul

Collaborating Authors

Röttger, Paul

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Röttger, Paul, Kirk, Hannah Rose, Vidgen, Bertie, Attanasio, Giuseppe, Bianchi, Federico, Hovy, Dirk

arXiv.org Artificial IntelligenceOct-17-2023

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2308.01263

Country:

Europe > United Kingdom (0.93)
Asia > Middle East > UAE (0.14)
North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report (0.82)

Industry:

Retail (1.00)
Law (1.00)
Transportation (0.93)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

Kirk, Hannah Rose, Bean, Andrew M., Vidgen, Bertie, Röttger, Paul, Hale, Scott A.

arXiv.org Artificial IntelligenceOct-11-2023

Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. In this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the ACL and arXiv repositories.First, we summarise the past, pre-LLM trends for integrating human feedback into language models. Second, we give an overview of present techniques and practices, as well as the motivations for using feedback; conceptual frameworks for defining values and preferences; and how feedback is collected and from whom. Finally, we encourage a better future of feedback learning in LLMs by raising five unresolved conceptual and practical challenges.

large language model, natural language, subjective human preference and value, (3 more...)

arXiv.org Artificial Intelligence

2310.07629

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Bianchi, Federico, Suzgun, Mirac, Attanasio, Giuseppe, Röttger, Paul, Jurafsky, Dan, Hashimoto, Tatsunori, Zou, James

arXiv.org Artificial IntelligenceSep-25-2023

Training large language models to follow instructions makes them perform better on a wide range of tasks, generally becoming more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not safety, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) in the training set when fine-tuning a model like LLaMA can substantially improve their safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find a behavior of exaggerated safety, where too much safety-tuning makes models refuse to respond to reasonable prompts that superficially resemble unsafe ones. Our study sheds light on trade-offs in training LLMs to follow instructions and exhibit safe behavior.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2309.07875

Country:

North America > Canada (0.14)
North America > United States (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.68)
Government (0.46)
Media > News (0.46)
Law Enforcement & Public Safety (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics

Orlikowski, Matthias, Röttger, Paul, Cimiano, Philipp, Hovy, Dirk

arXiv.org Artificial IntelligenceJun-20-2023

Many NLP tasks exhibit human label variation, where different annotators give different labels to the same texts. This variation is known to depend, at least in part, on the sociodemographics of annotators. Recent research aims to model individual annotator behaviour rather than predicting aggregated labels, and we would expect that sociodemographic information is useful for these models. On the other hand, the ecological fallacy states that aggregate group behaviour, such as the behaviour of the average female annotator, does not necessarily explain individual behaviour. To account for sociodemographics in models of individual annotator behaviour, we introduce group-specific layers to multi-annotator models. In a series of experiments for toxic content detection, we find that explicitly accounting for sociodemographic attributes in this way does not significantly improve model performance. This result shows that individual annotation behaviour depends on much more than just sociodemographics.

annotator, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.11559

Country:

Europe (1.00)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SemEval-2023 Task 10: Explainable Detection of Online Sexism

Kirk, Hannah Rose, Yin, Wenjie, Vidgen, Bertie, Röttger, Paul

arXiv.org Artificial IntelligenceMay-8-2023

Online sexism is a widespread and harmful phenomenon. Automated tools can assist the detection of sexism at scale. Binary detection, however, disregards the diversity of sexist content, and fails to provide clear explanations for why something is sexist. To address this issue, we introduce SemEval Task 10 on the Explainable Detection of Online Sexism (EDOS). We make three main contributions: i) a novel hierarchical taxonomy of sexist content, which includes granular vectors of sexism to aid explainability; ii) a new dataset of 20,000 social media comments with fine-grained labels, along with larger unlabelled datasets for model adaptation; and iii) baseline models as well as an analysis of the methods, results and errors for participant submissions to our task.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2303.04222

Country:

Asia (0.67)
North America > United States > Minnesota (0.28)
Europe > United Kingdom > England (0.28)

Genre:

Overview (1.00)
Research Report (0.81)

Industry:

Law > Civil Rights & Constitutional Law (1.00)
Health & Medicine (0.68)
Information Technology > Services (0.67)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback

Kirk, Hannah Rose, Vidgen, Bertie, Röttger, Paul, Hale, Scott A.

arXiv.org Artificial IntelligenceMar-9-2023

Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing. This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs. While alignment techniques like reinforcement learning with human feedback (RLHF) and red-teaming can mitigate some safety concerns and improve model capabilities, it is unlikely that an aggregate fine-tuning process can adequately represent the full range of users' preferences and values. Different people may legitimately disagree on their preferences for language and conversational norms, as well as on values or ideologies which guide their communication. Personalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user. However, there are several normative challenges in defining the bounds of a societally-acceptable and safe degree of personalisation. In this paper, we ask how, and in what ways, LLMs should be personalised. First, we review literature on current paradigms for aligning LLMs with human feedback, and identify issues including (i) a lack of clarity regarding what alignment means; (ii) a tendency of technology providers to prescribe definitions of inherently subjective preferences and values; and (iii) a 'tyranny of the crowdworker', exacerbated by a lack of documentation in who we are really aligning to. Second, we present a taxonomy of benefits and risks associated with personalised LLMs, for individuals and society at large. Finally, we propose a three-tiered policy framework that allows users to experience the benefits of personalised alignment, while restraining unsafe and undesirable LLM-behaviours within (supra-)national and organisational bounds.

computational linguistic, information retrieval, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2303.05453

Country:

Europe > United Kingdom (1.00)
Asia (1.00)
North America > United States > Minnesota (0.27)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback