Goto

Collaborating Authors

 Personal


Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

arXiv.org Artificial Intelligence

The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.


A conclusive remark on linguistic theorizing and language modeling

arXiv.org Artificial Intelligence

Considering the proliferation of responses to Piantadosi's original paper and the ongoing debate sparked by this special issue of the Italian Journal of Linguistics, it is clear that the discussion has touched a raw nerve in linguistic theorizing . In the original target paper (Chesi, this issue), I illustrated three prototypical (and in many respects, extreme) positions -- the computational, theoretical, and experimental perspectives -- without explicitly endorsing any of them. Instead, I attempted to highlight what I believe are the key weaknesses o f each of these prototypical stances, ultimately concluding that formal (i.e., ' generative ') linguistics -- more specifically, Minimalis m, my theoretical comfort zone -- must adopt practices and tools that are common in both computational and experimental fields . As noted by most respondents, the title and some of the more extreme statements were intended as mild provocations to draw attention to core issues affecting linguistic theorizing . M y position -- somehow obscured behind the ' three - body problem ' -- is that any relevant scientific progress is driven by theoretical insight, not by trawling using experimental or computational methods that are cost - inefficient, energy - intensive, and ultimately unsustainable . Moreover, in full agreement with most of the replies, I believe that the success of certain large language models (L L Ms), which are based on specific architectural assumptions, do es not constitute a refutation of the generative paradigm. On the contrary, it strongly supports several key intuitions that have emerged within the generative linguistic tradition (Rizzi this issue) . H owever, a concrete problem of ' incommensurability ' arises (Hao this issue), as differing methodologies and specialized jargon (Butt this issue) often result in circular, unresolved discussions .


Evaluating Apple Intelligence's Writing Tools for Privacy Against Large Language Model-Based Inference Attacks: Insights from Early Datasets

arXiv.org Artificial Intelligence

The misuse of Large Language Models (LLMs) to infer emotions from text for malicious purposes, known as emotion inference attacks, poses a significant threat to user privacy. In this paper, we investigate the potential of Apple Intelligence's writing tools, integrated across iPhone, iPad, and MacBook, to mitigate these risks through text modifications such as rewriting and tone adjustment. By developing early novel datasets specifically for this purpose, we empirically assess how different text modifications influence LLM-based detection. This capability suggests strong potential for Apple Intelligence's writing tools as privacy-preserving mechanisms. Our findings lay the groundwork for future adaptive rewriting systems capable of dynamically neutralizing sensitive emotional content to enhance user privacy. To the best of our knowledge, this research provides the first empirical analysis of Apple Intelligence's text-modification tools within a privacy-preservation context with the broader goal of developing on-device, user-centric privacy-preserving mechanisms to protect against LLMs-based advanced inference attacks on deployed systems.


ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling

arXiv.org Artificial Intelligence

Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAm-pling ( ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.


MultiHoax: A Dataset of Multi-hop False-Premise Questions

arXiv.org Artificial Intelligence

As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.


The Machine Ethics podcast โ€“ DeepDive: AI and the environment

AIHub

Hosted by Ben Byford, The Machine Ethics Podcast brings together interviews with academics, authors, business leaders, designers and engineers on the subject of autonomous algorithms, artificial intelligence, machine learning, and technology's impact on society. This is our 100th episode! A super special look at AI and the environment, we interviewed four experts for this DeepDive episode. We chatted about water stress, the energy usage of AI systems and data centres, using AI for fossil fuel discovery, the geo-political nature of AI, GenAI vs other ML algorithms for energy use, demanding transparency on energy usage for training and operating AI, more AI regulation for carbon consumption, things we can change today like picking renewable hosting solutions, publishing your data, when doing "responsible AI" you must include the environment, considering who are the controllers of the technology and what do they want, and moreโ€ฆ Hannah Smith is Director of Operations for Green Web Foundation and co-founder of Green Tech South West. She has a background in Computer Science.


Google DeepMind's CEO Thinks AI Will Make Humans Less Selfish

WIRED

If you buy that artificial intelligence is a once-in-a-species disruption, then what Demis Hassabis thinks should be of vital interest to you. Hassabis leads the AI charge for Google, arguably the best-equipped of the companies spending many billions of dollars to bring about that upheaval. He's among those powerful leaders gunning to build artificial general intelligence, the technology that will supposedly have machines do everything humans do, but better. None of his competitors, however, have earned a Nobel Prize and a knighthood for their achievements. Sir Demis is the exception--and he did it all through games.


Sensitivity-Aware Density Estimation in Multiple Dimensions

arXiv.org Artificial Intelligence

We formulate an optimization problem to estimate probability densities in the context of multidimensional problems that are sampled with uneven probability. It considers detector sensitivity as an heterogeneous density and takes advantage of the computational speed and flexible boundary conditions offered by splines on a grid. We choose to regularize the Hessian of the spline via the nuclear norm to promote sparsity. As a result, the method is spatially adaptive and stable against the choice of the regularization parameter, which plays the role of the bandwidth. We test our computational pipeline on standard densities and provide software. We also present a new approach to PET rebinning as an application of our framework.


A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit

arXiv.org Artificial Intelligence

The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot's adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants' confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants' language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.


What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma

arXiv.org Artificial Intelligence

Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma. Our corpus is openly available at https://github.com/HanMeng2004/Mental-Health-Stigma-Interview-Corpus.