AITopics | Hovy, Dirk

Collaborating Authors

Hovy, Dirk

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions

Orlikowski, Matthias, Pei, Jiaxin, Röttger, Paul, Cimiano, Philipp, Jurgens, David, Hovy, Dirk

arXiv.org Artificial IntelligenceFeb-28-2025

People naturally vary in their annotations for subjective questions and some of this variation is thought to be due to the person's sociodemographic characteristics. LLMs have also been used to label data, but recent work has shown that models perform poorly when prompted with sociodemographic attributes, suggesting limited inherent sociodemographic knowledge. Here, we ask whether LLMs can be trained to be accurate sociodemographic models of annotator variation. Using a curated dataset of five tasks with standardized sociodemographics, we show that models do improve in sociodemographic prompting when trained but that this performance gain is largely due to models learning annotator-specific behaviour rather than sociodemographic patterns. Across all tasks, our results suggest that models learn little meaningful connection between sociodemographics and annotation, raising doubts about the current use of LLMs for simulating sociodemographic variation and behaviour.

annotator, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.20897

Country:

North America > Mexico > Mexico City (0.14)
Europe > Middle East > Malta (0.14)
North America > United States > Oregon (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.48)
Materials > Chemicals > Industrial Gases > Liquified Gas (0.46)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.46)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

Röttger, Paul, Hinck, Musashi, Hofmann, Valentin, Hackenburg, Kobi, Pyatkin, Valentina, Brahman, Faeze, Hovy, Dirk

arXiv.org Artificial IntelligenceFeb-12-2025

Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. "write a blog about") and 212 political issues (e.g. "AI regulation") from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2502.08395

Country:

North America > United States (1.00)
Asia (1.00)

Genre: Research Report > New Finding (0.68)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

Röttger, Paul, Attanasio, Giuseppe, Friedrich, Felix, Goldzycher, Janis, Parrish, Alicia, Bhardwaj, Rishabh, Di Bonaventura, Chiara, Eng, Roman, Geagea, Gaia El Khoury, Goswami, Sujata, Han, Jieun, Hovy, Dirk, Jeong, Seogyeong, Jeretič, Paloma, Plaza-del-Arco, Flor Miriam, Rooein, Donya, Schramowski, Patrick, Shaitarova, Anastassia, Shen, Xudong, Willats, Richard, Zugarini, Andrea, Vidgen, Bertie

arXiv.org Artificial IntelligenceJan-17-2025

Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.

artificial intelligence, multimodal safety test suite, vision-language model, (1 more...)

arXiv.org Artificial Intelligence

2501.10057

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (0.60)
Information Technology > Artificial Intelligence > Natural Language (0.60)

Add feedback

Divine LLaMAs: Bias, Stereotypes, Stigmatization, and Emotion Representation of Religion in Large Language Models

Plaza-del-Arco, Flor Miriam, Curry, Amanda Cercas, Paoli, Susanna, Curry, Alba, Hovy, Dirk

arXiv.org Artificial IntelligenceJul-9-2024

Emotions play important epistemological and cognitive roles in our lives, revealing our values and guiding our actions. Previous work has shown that LLMs display biases in emotion attribution along gender lines. However, unlike gender, which says little about our values, religion, as a socio-cultural system, prescribes a set of beliefs and values for its followers. Religions, therefore, cultivate certain emotions. Moreover, these rules are explicitly laid out and interpreted by religious leaders. Using emotion attribution, we explore how different religions are represented in LLMs. We find that: Major religions in the US and European countries are represented with more nuance, displaying a more shaded model of their beliefs. Eastern religions like Hinduism and Buddhism are strongly stereotyped. Judaism and Islam are stigmatized -- the models' refusal skyrocket. We ascribe these to cultural bias in LLMs and the scarcity of NLP literature on religion. In the rare instances where religion is discussed, it is often in the context of toxic language, perpetuating the perception of these religions as inherently toxic. This finding underscores the urgent need to address and rectify these biases. Our research underscores the crucial role emotions play in our lives and how our values influence them.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2407.06908

Country:

North America > United States (0.48)
North America > Mexico > Mexico City (0.14)
Europe > United Kingdom > England (0.14)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Wang, Xinpeng, Ma, Bolei, Hu, Chengzhi, Weber-Genzel, Leon, Röttger, Paul, Kreuter, Frauke, Hovy, Dirk, Plank, Barbara

arXiv.org Artificial IntelligenceJul-4-2024

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model's diverse response styles such as starting with "Sure" or refusing to answer. Consequently, MCQ evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.14499

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry: Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps

Attanasio, Giuseppe, Savoldi, Beatrice, Fucci, Dennis, Hovy, Dirk

arXiv.org Artificial IntelligenceJun-19-2024

Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models. Surprisingly, those gaps are not explained by acoustic or lexical properties. However, probing internal model states reveals a correlation with gendered performance gap. I.e., the easier it is to distinguish speaker gender in a language using probes, the more the gap reduces, favoring female speakers. Our results show that gender disparities persist even in state-of-the-art models. Our findings have implications for the improvement of multilingual ASR systems, underscoring the importance of accessibility to training data and nuanced evaluation to predict and mitigate gender gaps. We release all code and artifacts at https://github.com/g8a9/multilingual-asr-gender-gap.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.17954

Country:

Europe (1.00)
Asia (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts

Rooein, Donya, Rottger, Paul, Shaitarova, Anastassia, Hovy, Dirk

arXiv.org Artificial IntelligenceJun-6-2024

Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2405.09482

Country:

Europe (0.68)
North America > United States (0.28)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.87)

Industry: Education > Educational Setting > K-12 Education > Secondary School (0.32)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

Röttger, Paul, Hofmann, Valentin, Pyatkin, Valentina, Hinck, Musashi, Kirk, Hannah Rose, Schütze, Hinrich, Hovy, Dirk

arXiv.org Artificial IntelligenceJun-5-2024

Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2402.16786

Country:

North America > United States (0.46)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.46)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

Angry Men, Sad Women: Large Language Models Reflect Gendered Stereotypes in Emotion Attribution

Plaza-del-Arco, Flor Miriam, Curry, Amanda Cercas, Curry, Alba, Abercrombie, Gavin, Hovy, Dirk

arXiv.org Artificial IntelligenceMay-28-2024

Large language models (LLMs) reflect societal norms and biases, especially about gender. While societal biases and stereotypes have been extensively researched in various NLP applications, there is a surprising gap for emotion analysis. However, emotion and gender are closely linked in societal discourse. E.g., women are often thought of as more empathetic, while men's anger is more socially accepted. To fill this gap, we present the first comprehensive study of gendered emotion attribution in five state-of-the-art LLMs (open- and closed-source). We investigate whether emotions are gendered, and whether these variations are based on societal stereotypes. We prompt the models to adopt a gendered persona and attribute emotions to an event like 'When I had a serious argument with a dear person'. We then analyze the emotions generated by the models in relation to the gender-event pairs. We find that all models consistently exhibit gendered emotions, influenced by gender stereotypes. These findings are in line with established research in psychology and gender studies. Our study sheds light on the complex societal interplay between language, gender, and emotion. The reproduction of emotion stereotypes in LLMs allows us to use those models to study the topic in detail, but raises questions about the predictive use of those same LLMs for emotion applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.03121

Country:

Europe (1.00)
North America > United States > Louisiana (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

What Can Natural Language Processing Do for Peer Review?

Kuznetsov, Ilia, Afzal, Osama Mohammed, Dercksen, Koen, Dycke, Nils, Goldberg, Alexander, Hope, Tom, Hovy, Dirk, Kummerfeld, Jonathan K., Lauscher, Anne, Leyton-Brown, Kevin, Lu, Sheng, Mausam, null, Mieskes, Margot, Névéol, Aurélie, Pruthi, Danish, Qu, Lizhen, Schwartz, Roy, Smith, Noah A., Solorio, Thamar, Wang, Jingyan, Zhu, Xiaodan, Rogers, Anna, Shah, Nihar B., Gurevych, Iryna

arXiv.org Artificial IntelligenceMay-10-2024

The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.06563

Country:

Europe (1.00)
Asia > Middle East (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback