Goto

Collaborating Authors

 chatgpt3


Unmasking Conversational Bias in AI Multiagent Systems

Coppolillo, Erica, Manco, Giuseppe, Aiello, Luca Maria

arXiv.org Artificial Intelligence

Detecting biases in the outputs produced by generative models is essential to reduce the potential risks associated with their application in critical settings. However, the majority of existing methodologies for identifying biases in generated text consider the models in isolation and neglect their contextual applications. Specifically, the biases that may arise in multi-agent systems involving generative models remain under-researched. To address this gap, we present a framework designed to quantify biases within multi-agent systems of conversational Large Language Models (LLMs). Our approach involves simulating small echo chambers, where pairs of LLMs, initialized with aligned perspectives on a polarizing topic, engage in discussions. Contrary to expectations, we observe significant shifts in the stance expressed in the generated messages, particularly within echo chambers where all agents initially express conservative viewpoints, in line with the well-documented political bias of many LLMs toward liberal positions. Crucially, the bias observed in the echo-chamber experiment remains undetected by current state-of-the-art bias detection methods that rely on questionnaires. This highlights a critical need for the development of a more sophisticated toolkit for bias detection and mitigation for AI multi-agent systems. The code to perform the experiments is publicly available at https://anonymous.4open.science/r/LLMsConversationalBias-7725.


Equity in the Use of ChatGPT for the Classroom: A Comparison of the Accuracy and Precision of ChatGPT 3.5 vs. ChatGPT4 with Respect to Statistics and Data Science Exams

McGee, Monnie, Sadler, Bivin

arXiv.org Artificial Intelligence

The association of social mobility with a college education has been studied since the early 1950's [1]. Although there are some indications that a college education is not as effective as it once was in helping graduates climb the social ladder [2], it is still the most reliable way of doing so. US News & World Report updated its rankings in 2023 to include social mobility [3], and many institutions of higher education are paying more attention to recruitment of first-generation college students and talented students from disadvantaged backgrounds. With the inclusion of such students in the typical college class comes some important considerations. For example, a student from difficult financial circumstances with an academic background to match the profile of any student an elite institution will have more difficulty paying for textbooks, a laptop, a smartphone, and other items that are almost essential to current college life [2]. As of November 2022, one such item that students from advantaged backgrounds will have access to that those from lower income brackets will not is ChatGPT4 [4]. It currently costs $20 per month for a subscription and has been called a "significant leap forward" compared to ChatGPT3.5 [5], which is free [6]. While use of generative AI is prohibited in some college classrooms, this is hard to police, and many students use it regardless of classroom restrictions [7]. When generative AI is allowed, there is a wide array of platforms from which students can choose.


Timeline-based Sentence Decomposition with In-Context Learning for Temporal Fact Extraction

Chen, Jianhao, Ouyang, Haoyuan, Ren, Junyang, Ding, Wentao, Hu, Wei, Qu, Yuzhong

arXiv.org Artificial Intelligence

Facts extraction is pivotal for constructing knowledge graphs. Recently, the increasing demand for temporal facts in downstream tasks has led to the emergence of the task of temporal fact extraction. In this paper, we specifically address the extraction of temporal facts from natural language text. Previous studies fail to handle the challenge of establishing time-to-fact correspondences in complex sentences. To overcome this hurdle, we propose a timeline-based sentence decomposition strategy using large language models (LLMs) with in-context learning, ensuring a fine-grained understanding of the timeline associated with various facts. In addition, we evaluate the performance of LLMs for direct temporal fact extraction and get unsatisfactory results. To this end, we introduce TSDRE, a method that incorporates the decomposition capabilities of LLMs into the traditional fine-tuning of smaller pre-trained language models (PLMs). To support the evaluation, we construct ComplexTRED, a complex temporal fact extraction dataset. Our experiments show that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and ComplexTRED datasets.


DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Chen, Kedi, Chen, Qin, Zhou, Jie, He, Yishen, He, Liang

arXiv.org Artificial Intelligence

Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.


EDA Corpus: A Large Language Model Dataset for Enhanced Interaction with OpenROAD

Wu, Bing-Yue, Sharma, Utsav, Kankipati, Sai Rahul Dhanvi, Yadav, Ajay, George, Bintu Kappil, Guntupalli, Sai Ritish, Rovinski, Austin, Chhabria, Vidya A.

arXiv.org Artificial Intelligence

Large language models (LLMs) serve as powerful tools for design, providing capabilities for both task automation and design assistance. Recent advancements have shown tremendous potential for facilitating LLM integration into the chip design process; however, many of these works rely on data that are not publicly available and/or not permissively licensed for use in LLM training and distribution. In this paper, we present a solution aimed at bridging this gap by introducing an open-source dataset tailored for OpenROAD, a widely adopted open-source EDA toolchain. The dataset features over 1000 data points and is structured in two formats: (i) a pairwise set comprised of question prompts with prose answers, and (ii) a pairwise set comprised of code prompts and their corresponding OpenROAD scripts. By providing this dataset, we aim to facilitate LLM-focused research within the EDA domain. The dataset is available at https://github.com/OpenROAD-Assistant/EDA-Corpus.


Laissez-Faire Harms: Algorithmic Biases in Generative Language Models

Shieh, Evan, Vassel, Faye-Marie, Sugimoto, Cassidy, Monroe-White, Thema

arXiv.org Artificial Intelligence

The rapid deployment of generative language models (LMs) has raised concerns about social biases affecting the well-being of diverse consumers. The extant literature on generative LMs has primarily examined bias via explicit identity prompting. However, prior research on bias in earlier language-based technology platforms, including search engines, has shown that discrimination can occur even when identity terms are not specified explicitly. Studies of bias in LM responses to open-ended prompts (where identity classifications are left unspecified) are lacking and have not yet been grounded in end-consumer harms. Here, we advance studies of generative LM bias by considering a broader set of natural use cases via open-ended prompting. In this "laissez-faire" setting, we find that synthetically generated texts from five of the most pervasive LMs (ChatGPT3.5, ChatGPT4, Claude2.0, Llama2, and PaLM2) perpetuate harms of omission, subordination, and stereotyping for minoritized individuals with intersectional race, gender, and/or sexual orientation identities (AI/AN, Asian, Black, Latine, MENA, NH/PI, Female, Non-binary, Queer). We find widespread evidence of bias to an extent that such individuals are hundreds to thousands of times more likely to encounter LM-generated outputs that portray their identities in a subordinated manner compared to representative or empowering portrayals. We also document a prevalence of stereotypes (e.g. perpetual foreigner) in LM-generated outputs that are known to trigger psychological harms that disproportionately affect minoritized individuals. These include stereotype threat, which leads to impaired cognitive performance and increased negative self-perception. Our findings highlight the urgent need to protect consumers from discriminatory harms caused by language models and invest in critical AI education programs tailored towards empowering diverse consumers.


A Regularization-based Transfer Learning Method for Information Extraction via Instructed Graph Decoder

Chen, Kedi, Zhou, Jie, Chen, Qin, Liu, Shunyu, He, Liang

arXiv.org Artificial Intelligence

Information extraction (IE) aims to extract complex structured information from the text. Numerous datasets have been constructed for various IE tasks, leading to time-consuming and labor-intensive data annotations. Nevertheless, most prevailing methods focus on training task-specific models, while the common knowledge among different IE tasks is not explicitly modeled. Moreover, the same phrase may have inconsistent labels in different tasks, which poses a big challenge for knowledge transfer using a unified model. In this study, we propose a regularization-based transfer learning method for IE (TIE) via an instructed graph decoder. Specifically, we first construct an instruction pool for datasets from all well-known IE tasks, and then present an instructed graph decoder, which decodes various complex structures into a graph uniformly based on corresponding instructions. In this way, the common knowledge shared with existing datasets can be learned and transferred to a new dataset with new labels. Furthermore, to alleviate the label inconsistency problem among various IE tasks, we introduce a task-specific regularization strategy, which does not update the gradients of two tasks with'opposite direction'. We conduct extensive experiments on 12 datasets spanning four IE tasks, and the results demonstrate the great advantages of our proposed method.


Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI

Abolghasemi, MAhdi, Ganbold, Odkhishig, Rotaru, Kristian

arXiv.org Artificial Intelligence

This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector, particularly during standard and promotional sales periods. Utilizing a controlled experimental setup with 123 human forecasters and five LLMs, including ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2, we evaluated forecasting precision through Mean Absolute Percentage Error. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact. The findings indicate that LLMs do not consistently outperform humans in forecasting accuracy and that advanced statistical forecasting models do not uniformly enhance the performance of either human forecasters or LLMs. Both human and LLM forecasters exhibited increased forecasting errors, particularly during promotional periods and under the influence of positive external impacts. Our findings call for careful consideration when integrating LLMs into practical forecasting processes.


AI is already more creative than YOU: ChatGPT outperformed humans in creative thinking experiment

Daily Mail - Science & tech

Artificial intelligence outperforms humans in strategy games, website design and data processing, but now the tech can add creative thinking to the list. AI chatbots surpassed humans when asked to devise alternative uses for everyday objects. Researchers said the AI used a skill known as divergent thinking, a thought process or method used to generate creative ideas by exploring many possible solutions. The study by the University of Stavanger in Norway involved 256 human volunteers and three AI chatbots – ChatGPT3, ChatGPT4, and Copy.Ai - that were asked to provide multiple uses for a rope, box, pencil and candle. When assessed with a type of divergent thinking exercise known as alternate uses tasks, which asks a person to think of as many uses as possible for a simple object, chatbots, on average, performed better than humans.


Are Deep Neural Networks SMARTer than Second Graders?

Cherian, Anoop, Peng, Kuan-Chuan, Lohit, Suhas, Smith, Kevin A., Tenenbaum, Joshua B.

arXiv.org Artificial Intelligence

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a subset of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.