Goto

Collaborating Authors

 chatgpt response


EmoXpt: Analyzing Emotional Variances in Human Comments and LLM-Generated Responses

Pyreddy, Shireesh Reddy, Zaman, Tarannum Shaila

arXiv.org Artificial Intelligence

The widespread adoption of generative AI has generated diverse opinions, with individuals expressing both support and criticism of its applications. This study investigates the emotional dynamics surrounding generative AI by analyzing human tweets referencing terms such as ChatGPT, OpenAI, Copilot, and LLMs. To further understand the emotional intelligence of ChatGPT, we examine its responses to selected tweets, highlighting differences in sentiment between human comments and LLM-generated responses. We introduce EmoXpt, a sentiment analysis framework designed to assess both human perspectives on generative AI and the sentiment embedded in ChatGPT's responses. Unlike prior studies that focus exclusively on human sentiment, EmoXpt uniquely evaluates the emotional expression of ChatGPT. Experimental results demonstrate that LLM-generated responses are notably more efficient, cohesive, and consistently positive than human responses.


Fool Me, Fool Me: User Attitudes Toward LLM Falsehoods

Nirman, Diana Bar-Or, Weizman, Ariel, Azaria, Amos

arXiv.org Artificial Intelligence

While Large Language Models (LLMs) have become central tools in various fields, they often provide inaccurate or false information. This study examines user preferences regarding falsehood responses from LLMs. Specifically, we evaluate preferences for LLM responses where false statements are explicitly marked versus unmarked responses and preferences for confident falsehoods compared to LLM disclaimers acknowledging a lack of knowledge. Additionally, we investigate how requiring users to assess the truthfulness of statements influences these preferences. Surprisingly, 61\% of users prefer unmarked falsehood responses over marked ones, and 69\% prefer confident falsehoods over LLMs admitting lack of knowledge. In all our experiments, a total of 300 users participated, contributing valuable data to our analysis and conclusions. When users are required to evaluate the truthfulness of statements, preferences for unmarked and falsehood responses decrease slightly but remain high. These findings suggest that user preferences, which influence LLM training via feedback mechanisms, may inadvertently encourage the generation of falsehoods. Future research should address the ethical and practical implications of aligning LLM behavior with such preferences.


Are Generative Language Models Multicultural? A Study on Hausa Culture and Emotions using ChatGPT

Ahmad, Ibrahim Said, Dudy, Shiran, Ramachandranpillai, Resmi, Church, Kenneth

arXiv.org Artificial Intelligence

Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa's culture and emotions. We compare responses generated by ChatGPT with those provided by native Hausa speakers on 37 culturally relevant questions. We conducted experiments using emotion analysis and applied two similarity metrics to measure the alignment between human and ChatGPT responses. We also collected human participants ratings and feedback on ChatGPT responses. Our results show that ChatGPT has some level of similarity to human responses, but also exhibits some gaps and biases in its knowledge and awareness of the Hausa culture and emotions. We discuss the implications and limitations of our methodology and analysis and suggest ways to improve the performance and evaluation of LLMs for low-resource languages.


Effectiveness of ChatGPT in explaining complex medical reports to patients

Sun, Mengxuan, Reiter, Ehud, Kiltie, Anne E, Ramsay, George, Duncan, Lisa, Murchie, Peter, Adam, Rosalind

arXiv.org Artificial Intelligence

Electronic health records contain detailed information about the medical condition of patients, but they are difficult for patients to understand even if they have access to them. We explore whether ChatGPT (GPT 4) can help explain multidisciplinary team (MDT) reports to colorectal and prostate cancer patients. These reports are written in dense medical language and assume clinical knowledge, so they are a good test of the ability of ChatGPT to explain complex medical reports to patients. We asked clinicians and lay people (not patients) to review explanations and responses of ChatGPT. We also ran three focus groups (including cancer patients, caregivers, computer scientists, and clinicians) to discuss output of ChatGPT. Our studies highlighted issues with inaccurate information, inappropriate language, limited personalization, AI distrust, and challenges integrating large language models (LLMs) into clinical workflow. These issues will need to be resolved before LLMs can be used to explain complex personal medical information to patients.


The AI Companion in Education: Analyzing the Pedagogical Potential of ChatGPT in Computer Science and Engineering

He, Zhangying, Nguyen, Thomas, Miari, Tahereh, Aliasgari, Mehrdad, Rafatirad, Setareh, Sayadi, Hossein

arXiv.org Artificial Intelligence

Artificial Intelligence (AI), with ChatGPT as a prominent example, has recently taken center stage in various domains including higher education, particularly in Computer Science and Engineering (CSE). The AI revolution brings both convenience and controversy, offering substantial benefits while lacking formal guidance on their application. The primary objective of this work is to comprehensively analyze the pedagogical potential of ChatGPT in CSE education, understanding its strengths and limitations from the perspectives of educators and learners. We employ a systematic approach, creating a diverse range of educational practice problems within CSE field, focusing on various subjects such as data science, programming, AI, machine learning, networks, and more. According to our examinations, certain question types, like conceptual knowledge queries, typically do not pose significant challenges to ChatGPT, and thus, are excluded from our analysis. Alternatively, we focus our efforts on developing more in-depth and personalized questions and project-based tasks. These questions are presented to ChatGPT, followed by interactions to assess its effectiveness in delivering complete and meaningful responses. To this end, we propose a comprehensive five-factor reliability analysis framework to evaluate the responses. This assessment aims to identify when ChatGPT excels and when it faces challenges. Our study concludes with a correlation analysis, delving into the relationships among subjects, task types, and limiting factors. This analysis offers valuable insights to enhance ChatGPT's utility in CSE education, providing guidance to educators and students regarding its reliability and efficacy.


ChatGPT Incorrectness Detection in Software Reviews

Tanzil, Minaoar Hossain, Khan, Junaed Younus, Uddin, Gias

arXiv.org Artificial Intelligence

We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.


Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Jin, Di, Mehri, Shikib, Hazarika, Devamanyu, Padmakumar, Aishwarya, Lee, Sungjin, Liu, Yang, Namazifar, Mahdi

arXiv.org Artificial Intelligence

Learning from human feedback is a prominent technique to align the output of large language models (LLMs) with human expectations. Reinforcement learning from human feedback (RLHF) leverages human preference signals that are in the form of ranking of response pairs to perform this alignment. However, human preference on LLM outputs can come in much richer forms including natural language, which may provide detailed feedback on strengths and weaknesses of a given response. In this work we investigate data efficiency of modeling human feedback that is in natural language. Specifically, we fine-tune an open-source LLM, e.g., Falcon-40B-Instruct, on a relatively small amount (1000 records or even less) of human feedback in natural language in the form of critiques and revisions of responses. We show that this model is able to improve the quality of responses from even some of the strongest LLMs such as ChatGPT, BARD, and Vicuna, through critique and revision of those responses. For instance, through one iteration of revision of ChatGPT responses, the revised responses have 56.6% win rate over the original ones, and this win rate can be further improved to 65.9% after applying the revision for five iterations.


Validating ChatGPT Facts through RDF Knowledge Graphs and Sentence Similarity

Mountantonakis, Michalis, Tzitzikas, Yannis

arXiv.org Artificial Intelligence

Since ChatGPT offers detailed responses without justifications, and erroneous facts even for popular persons, events and places, in this paper we present a novel pipeline that retrieves the response of ChatGPT in RDF and tries to validate the ChatGPT facts using one or more RDF Knowledge Graphs (KGs). To this end we leverage DBpedia and LODsyndesis (an aggregated Knowledge Graph that contains 2 billion triples from 400 RDF KGs of many domains) and short sentence embeddings, and introduce an algorithm that returns the more relevant triple(s) accompanied by their provenance and a confidence score. This enables the validation of ChatGPT responses and their enrichment with justifications and provenance. To evaluate this service (such services in general), we create an evaluation benchmark that includes 2,000 ChatGPT facts; specifically 1,000 facts for famous Greek Persons, 500 facts for popular Greek Places, and 500 facts for Events related to Greece. The facts were manually labelled (approximately 73% of ChatGPT facts were correct and 27% of facts were erroneous). The results are promising; indicatively for the whole benchmark, we managed to verify the 85.3% of the correct facts of ChatGPT and to find the correct answer for the 58% of the erroneous ChatGPT facts.


HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Li, Junyi, Cheng, Xiaoxue, Zhao, Wayne Xin, Nie, Jian-Yun, Wen, Ji-Rong

arXiv.org Artificial Intelligence

Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval), a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, i.e., sampling-then-filtering. Besides, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT is likely to generate hallucinated content in specific topics by fabricating unverifiable information (i.e., about $19.5\%$ responses). Moreover, existing LLMs face great challenges in recognizing the hallucinations in texts. However, our experiments also prove that providing external knowledge or adding reasoning steps can help LLMs recognize hallucinations. Our benchmark can be accessed at https://github.com/RUCAIBox/HaluEval.


Understanding Place Identity with Generative AI

Jang, Kee Moon, Chen, Junda, Kang, Yuhao, Kim, Junghwan, Lee, Jinhyung, Duarte, Fábio

arXiv.org Artificial Intelligence

Researchers are constantly leveraging new forms of data with the goal of understanding how people perceive the built environment and build the collective place identity of cities. Latest advancements in generative artificial intelligence (AI) models have enabled the production of realistic representations learned from vast amounts of data. In this study, we aim to test the potential of generative AI as the source of textual and visual information in capturing the place identity of cities assessed by filtered descriptions and images. We asked questions on the place identity of a set of 31 global cities to two generative AI models, ChatGPT and DALL-E2. Since generative AI has raised ethical concerns regarding its trustworthiness, we performed cross-validation to examine whether the results show similar patterns to real urban settings. In particular, we compared the outputs with Wikipedia data for text and images searched from Google for image. Our results indicate that generative AI models have the potential to capture the collective image of cities that can make them distinguishable. This study is among the first attempts to explore the capabilities of generative AI in understanding human perceptions of the built environment. It contributes to urban design literature by discussing future research opportunities and potential limitations.