Goto

Collaborating Authors

 chatgpt answer


SAFE: Improving LLM Systems using Sentence-Level In-generation Attribution

Batista, João Eduardo, Vatai, Emil, Wahib, Mohamed

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability and accountability are paramount. To be reliable, attribution systems require high accuracy for short-length attribution on retrieved data, i.e., attribution to a sentence within a document rather than the entire document. We propose SAFE, a Sentence-level A ttribution FramEwork for Retrieve-Augmented Generation (RAG) systems that attributes generated sentences during generation. This allows users to verify sentences as they read them and correct the model when the attribution indicates the generated text is not grounded in the documents, increasing the safety of LLM systems. This framework consists of two steps: predicting the required number of references for a sentence, and attributing the sentence. Our approach achieved 95% accuracy in the first step, which translated to 2.1\~6.0% improvements in the accuracy (normalized for maximum possible accuracy) of all attribution algorithms in our clean dataset, when compared to their top-1 accuracy. We also applied SAFE in real-world scenarios with documents containing hundreds to thousands of sentences. In these settings, SAFE reliably attributed sentences to their source documents, demonstrating that the method generalizes beyond controlled benchmarks. The SAFE framework and the training dataset are publicly available on GitHub.


Employing Label Models on ChatGPT Answers Improves Legal Text Entailment Performance

Nguyen, Chau, Nguyen, Le-Minh

arXiv.org Artificial Intelligence

The objective of legal text entailment is to ascertain whether the assertions in a legal query logically follow from the information provided in one or multiple legal articles. ChatGPT, a large language model, is robust in many natural language processing tasks, including legal text entailment: when we set the temperature = 0 (the ChatGPT answers are deterministic) and prompt the model, it achieves 70.64% accuracy on COLIEE 2022 dataset, which outperforms the previous SOTA of 67.89%. On the other hand, if the temperature is larger than zero, ChatGPT answers are not deterministic, leading to inconsistent answers and fluctuating results. We propose to leverage label models (a fundamental component of weak supervision techniques) to integrate the provisional answers by ChatGPT into consolidated labels. By that way, we treat ChatGPT provisional answers as noisy predictions which can be consolidated by label models. The experimental results demonstrate that this approach can attain an accuracy of 76.15%, marking a significant improvement of 8.26% over the prior state-of-the-art benchmark. Additionally, we perform an analysis of the instances where ChatGPT produces incorrect answers, then we classify the errors, offering insights that could guide potential enhancements for future research endeavors.


Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions

Kabir, Samia, Udo-Imeh, David N., Kou, Bonan, Zhang, Tianyi

arXiv.org Artificial Intelligence

Over the last decade, Q&A platforms have played a crucial role in how programmers seek help online. The emergence of ChatGPT, however, is causing a shift in this pattern. Despite ChatGPT's popularity, there hasn't been a thorough investigation into the quality and usability of its responses to software engineering queries. To address this gap, we undertook a comprehensive analysis of ChatGPT's replies to 517 questions from Stack Overflow (SO). We assessed the correctness, consistency, comprehensiveness, and conciseness of these responses. Additionally, we conducted an extensive linguistic analysis and a user study to gain insights into the linguistic and human aspects of ChatGPT's answers. Our examination revealed that 52% of ChatGPT's answers contain inaccuracies and 77% are verbose. Nevertheless, users still prefer ChatGPT's responses 39.34% of the time due to their comprehensiveness and articulate language style. These findings underscore the need for meticulous error correction in ChatGPT while also raising awareness among users about the potential risks associated with seemingly accurate answers.


ChatGPT: Jack of all trades, master of none

Kocoń, Jan, Cichecki, Igor, Kaszyca, Oliwier, Kochanek, Mateusz, Szydło, Dominika, Baran, Joanna, Bielaniewicz, Julita, Gruza, Marcin, Janz, Arkadiusz, Kanclerz, Kamil, Kocoń, Anna, Koptyra, Bartłomiej, Mieleszczenko-Kowszewicz, Wiktoria, Miłkowski, Piotr, Oleksy, Marcin, Piasecki, Maciej, Radliński, Łukasz, Wojtasik, Konrad, Woźniak, Stanisław, Kazienko, Przemysław

arXiv.org Artificial Intelligence

OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.


Why We're All Obsessed With the Mind-Blowing ChatGPT AI Chatbot - CNET

#artificialintelligence

Even if you aren't into artificial intelligence, pay attention, because this one is a big deal. The tool, from a power player in artificial intelligence called OpenAI, lets you type natural-language prompts. ChatGPT then offers conversational, if somewhat stilted, responses. The bot remembers the thread of your dialogue, using previous questions and answers to inform its next responses. It derives its answers from huge volumes of information on the internet. ChatGPT is a big deal. The tool seems pretty knowledgeable in areas where there's good training data for it to learn from.


Could ChatGPT herald the next stage for CX AI adoption? - UK News Group

#artificialintelligence

Over the last few weeks, you would have heard lots of noise about ChatGPT, the new model for conversational AI that was launched by OpenAI – the AI research and deployment company – at the end of November. What is particularly striking about ChatGPT is that it took just five days to reach one million signed-up users, and it's estimated that figure may already be over two million. In comparison, Instagram took three months to reach that number, Spotify five months, and Twitter two years. Why is it clearly capturing so much attention? And what's going on in the AI market, when just last month some commentators were questioning chatbot sector momentum – particularly with Amazon stripping costs and people out of its Alexa team?


Uses of ChatGPT: 30 incredible ways to use the AI-powered chatbot ChatGPT

#artificialintelligence

So, let's see what you can do with it, shall we? Rather than give you a description of our own, we thought, given the nature of this article, to let ChatGPT answer for itself. "I am a language model developed by OpenAI. I was trained on a diverse range of internet text, including websites, books, and more. This allows me to generate human-like text responses to a wide range of questions and prompts," ChatGPT explains. "My training data encompasses a wide range of topics, so I can converse on many subjects, including but not limited to science, history, mathematics, and current events. However, I am still just a machine, and while I can generate responses that are similar to what a human might say, I do not have thoughts, feelings, or consciousness," it adds.


How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection

Guo, Biyang, Zhang, Xin, Wang, Ziyuan, Jiang, Minqi, Nie, Jinran, Ding, Yuxuan, Yue, Jianwei, Wu, Yupeng

arXiv.org Artificial Intelligence

The introduction of ChatGPT has garnered widespread attention in both academic and industrial communities. ChatGPT is able to respond effectively to a wide range of human questions, providing fluent and comprehensive answers that significantly surpass previous public chatbots in terms of security and usefulness. On one hand, people are curious about how ChatGPT is able to achieve such strength and how far it is from human experts. On the other hand, people are starting to worry about the potential negative impacts that large language models (LLMs) like ChatGPT could have on society, such as fake news, plagiarism, and social security issues. In this work, we collected tens of thousands of comparison responses from both human experts and ChatGPT, with questions ranging from open-domain, financial, medical, legal, and psychological areas. We call the collected dataset the Human ChatGPT Comparison Corpus (HC3). Based on the HC3 dataset, we study the characteristics of ChatGPT's responses, the differences and gaps from human experts, and future directions for LLMs. We conducted comprehensive human evaluations and linguistic analyses of ChatGPT-generated content compared with that of humans, where many interesting results are revealed. After that, we conduct extensive experiments on how to effectively detect whether a certain text is generated by ChatGPT or humans. We build three different detection systems, explore several key factors that influence their effectiveness, and evaluate them in different scenarios. The dataset, code, and models are all publicly available at https://github.com/Hello-SimpleAI/chatgpt-comparison-detection.


ChatGPT: Smart, but Not Smart Enough - The New Stack

#artificialintelligence

Yes, AI can help with programming, but ChatGPT is not ready to be your programming buddy, especially regarding securing your code. Wouldn't it be great to have an AI pair programming friend to help you secure your code? But, while GitHub CoPilot can be handy -- leaving aside whether it's ethical or legal -- AI's new darling chatbot, ChatGPT, isn't ready for programming prime-time. I'll give you that ChatGPT is going to make life much harder for high-school English teachers. Going forward, anyone who assigns a homework paper on To Kill a Mockingbird will be much more likely to get an AI-written document than any real student thought about the literary masterpiece. But programming, especially secure programming, that's another story.


ChatGPT: What Is It & How Can You Use It?

#artificialintelligence

OpenAI introduced a long-form question-answering AI called ChatGPT that answers complex questions conversationally. It's a revolutionary technology because it's trained to learn what humans mean when they ask a question. Many users are awed at its ability to provide human-quality responses, inspiring the feeling that it may eventually have the power to disrupt how humans interact with computers and change how information is retrieved. ChatGPT is a large language model chatbot developed by OpenAI based on GPT-3.5. It has a remarkable ability to interact in conversational dialogue form and provide responses that can appear surprisingly human.