bergen
Measuring, Modeling, and Helping People Account for Privacy Risks in Online Self-Disclosures with AI
Krsek, Isadora, Kabra, Anubha, Dou, Yao, Naous, Tarek, Dabbish, Laura A., Ritter, Alan, Xu, Wei, Das, Sauvik
In pseudonymous online fora like Reddit, the benefits of self-disclosure are often apparent to users (e.g., I can vent about my in-laws to understanding strangers), but the privacy risks are more abstract (e.g., will my partner be able to tell that this is me?). Prior work has sought to develop natural language processing (NLP) tools that help users identify potentially risky self-disclosures in their text, but none have been designed for or evaluated with the users they hope to protect. Absent this assessment, these tools will be limited by the social-technical gap: users need assistive tools that help them make informed decisions, not paternalistic tools that tell them to avoid self-disclosure altogether. To bridge this gap, we conducted a study with N = 21 Reddit users; we had them use a state-of-the-art NLP disclosure detection model on two of their authored posts and asked them questions to understand if and how the model helped, where it fell short, and how it could be improved to help them make more informed decisions. Despite its imperfections, users responded positively to the model and highlighted its use as a tool that can help them catch mistakes, inform them of risks they were unaware of, and encourage self-reflection. However, our work also shows how, to be useful and usable, AI for supporting privacy decision-making must account for posting context, disclosure norms, and users' lived threat models, and provide explanations that help contextualize detected risks.
AI-Driven Agents with Prompts Designed for High Agreeableness Increase the Likelihood of Being Mistaken for a Human in the Turing Test
León-Domínguez, U., Flores-Flores, E. D., García-Jasso, A. J., Gómez-Cuellar, M. K., Torres-Sánchez, D., Basora-Marimon, A.
Large Language Models based on transformer algorithms have revolutionized Artificial Intelligence by enabling verbal interaction with machines akin to human conversation. These AI agents have surpassed the Turing Test, achieving confusion rates up to 50%. However, challenges persist, especially with the advent of robots and the need to humanize machines for improved Human-AI collaboration. In this experiment, three GPT agents with varying levels of agreeableness (disagreeable, neutral, agreeable) based on the Big Five Inventory were tested in a Turing Test. All exceeded a 50% confusion rate, with the highly agreeable AI agent surpassing 60%. This agent was also recognized as exhibiting the most human-like traits. Various explanations in the literature address why these GPT agents were perceived as human, including psychological frameworks for understanding anthropomorphism. These findings highlight the importance of personality engineering as an emerging discipline in artificial intelligence, calling for collaboration with psychology to develop ergonomic psychological models that enhance system adaptability in collaborative activities.
"It was 80% me, 20% AI": Seeking Authenticity in Co-Writing with Large Language Models
Hwang, Angel Hsing-Chi, Liao, Q. Vera, Blodgett, Su Lin, Olteanu, Alexandra, Trischler, Adam
Given the rising proliferation and diversity of AI writing assistance tools, especially those powered by large language models (LLMs), both writers and readers may have concerns about the impact of these tools on the authenticity of writing work. We examine whether and how writers want to preserve their authentic voice when co-writing with AI tools and whether personalization of AI writing support could help achieve this goal. We conducted semi-structured interviews with 19 professional writers, during which they co-wrote with both personalized and non-personalized AI writing-support tools. We supplemented writers' perspectives with opinions from 30 avid readers about the written work co-produced with AI collected through an online survey. Our findings illuminate conceptions of authenticity in human-AI co-creation, which focus more on the process and experience of constructing creators' authentic selves. While writers reacted positively to personalized AI writing tools, they believed the form of personalization needs to target writers' growth and go beyond the phase of text production. Overall, readers' responses showed less concern about human-AI co-writing. Readers could not distinguish AI-assisted work, personalized or not, from writers' solo-written work and showed positive attitudes toward writers experimenting with new technology for creative writing.
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation
Rau, David, Déjean, Hervé, Chirkova, Nadezhda, Formal, Thibault, Wang, Shuai, Nikoulina, Vassilina, Clinchant, Stéphane
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \url{https://github.com/naver/bergen}.
People cannot distinguish GPT-4 from a human in a Turing test
Jones, Cameron R., Bergen, Benjamin K.
We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.
Physics-based deep learning reveals rising heating demand heightens air pollution in Norwegian cities
Cao, Cong, Debnath, Ramit, Alvarez, R. Michael
Policymakers frequently analyze air quality and climate change in isolation, disregarding their interactions. This study explores the influence of specific climate factors on air quality by contrasting a regression model with K-Means Clustering, Hierarchical Clustering, and Random Forest techniques. We employ Physics-based Deep Learning (PBDL) and Long Short-Term Memory (LSTM) to examine the air pollution predictions. Our analysis utilizes ten years (2009-2018) of daily traffic, weather, and air pollution data from three major cities in Norway. Findings from feature selection reveal a correlation between rising heating degree days and heightened air pollution levels, suggesting increased heating activities in Norway are a contributing factor to worsening air quality. PBDL demonstrates superior accuracy in air pollution predictions compared to LSTM. This paper contributes to the growing literature on PBDL methods for more accurate air pollution predictions using environmental variables, aiding policymakers in formulating effective data-driven climate policies.
Thirsty Fabs
This year, Samsung is planning to open a semiconductor chip manufacturing plant in Taylor, TX, that will cost the company an estimated 17 billion. Intel is building a 20-billion facility in Columbus, OH, and industry leaders GlobalFoundries, TSMC, and Texas Instruments are building their own so-called chip fabs in the U.S. as well. This construction boom has been spurred in part by increasing demand for the smartphones, personal electronic devices, and Artificial Intelligence (AI) services that depend on chips, and the 50 billion in funding that the 2022 CHIPS and Science Act allocated to American semiconductor manufacturing has proven to be a strong incentive. Yet the boom is global, with new plants being developed all over the world. As companies plan these new chip fabs, one of the first questions they need to answer is where they are going to get their water.
Can Peanuts Fall in Love with Distributional Semantics?
Michaelov, James A., Coulson, Seana, Bergen, Benjamin K.
Context changes expectations about upcoming words - following a story involving an anthropomorphic peanut, comprehenders expect the sentence the peanut was in love more than the peanut was salted, as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This updating of expectations has been explained using Situation Models - mental representations of a described event. However, recent work showing that N400 amplitude is predictable from distributional information alone raises the question whether situation models are necessary for these contextual effects. We model the results of Nieuwland and van Berkum (2006) using six computational language models and three sets of word vectors, none of which have explicit situation models or semantic grounding. We find that a subset of these can fully model the effect found by Nieuwland and van Berkum (2006). Thus, at least some processing effects normally explained through situation models may not in fact require explicit situation models.
AIwriting: Relations Between Image Generation and Digital Writing
Rettberg, Scott, Memmott, Talan, Rettberg, Jill Walker, Nelson, Jason, Lichty, Patrick
During 2022, both transformer-based AI text generation sys-tems such as GPT-3 and AI text-to-image generation systems such as DALL-E 2 and Stable Diffusion made exponential leaps forward and are unquestionably altering the fields of digital art and electronic literature. In this panel a group of electronic literature authors and theorists consider new oppor-tunities for human creativity presented by these systems and present new works have produced during the past year that specifically address these systems as environments for literary expressions that are translated through iterative interlocutive processes into visual representations. The premise that binds these presentations is that these systems and the works gener-ated must be considered from a literary perspective, as they originate in human writing. In works ranging from a visual memoir of the personal experience of a health crisis, to interac-tive web comics, to architectures based on abstract poetic language, to political satire, four artists explore the capabili-ties of these writing environments for new genres of literary artist practice, while a digital culture theorist considers the origins and effects of the particular training datasets of human language and images on which these new hybrid forms are based.
Underwater autonomous mapping and characterization of marine debris in urban water bodies
Fossum, Trygve Olav, Sture, Øystein, Norgren-Aamot, Petter, Hansen, Ingrid Myrnes, Kvisvik, Bjørn Christian, Knag, Anne Christine
Marine debris originating from human activity has been accumulating in underwater environments such as oceans, lakes, and rivers for decades. The extent, type, and amount of waste is hard to assess as the exact mechanisms for spread are not understood, yielding unknown consequences for the marine environment and human health. Methods for detecting and mapping marine debris is therefore vital in order to gain insight into pollution dynamics, which in turn can be used to effectively plan and execute physical removal. Using an autonomous underwater vehicle (AUV), equipped with an underwater hyperspectral imager (UHI) and stereo-camera, marine debris was autonomously detected, mapped and quantified in the sheltered bay Store Lungegaardsvann in Bergen, Norway.