Goto

Collaborating Authors

 davinci


Acknowledgments

Neural Information Processing Systems

Our operating assumption is that the most popular sequences have been sufficiently present in the GPT -3 training data for the model to have a solid amount of information learned about them.



Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing

Karanjai, Rabimba, Hussain, Aftab, Rabin, Md Rafiqul Islam, Xu, Lei, Shi, Weidong, Alipour, Mohammad Amin

arXiv.org Artificial Intelligence

Unit testing is crucial in software engineering for ensuring quality. However, it's not widely used in parallel and high-performance computing software, particularly scientific applications, due to their smaller, diverse user base and complex logic. These factors make unit testing challenging and expensive, as it requires specialized knowledge and existing automated tools are often ineffective. To address this, we propose an automated method for generating unit tests for such software, considering their unique features like complex logic and parallel processing. Recently, large language models (LLMs) have shown promise in coding and testing. We explored the capabilities of Davinci (text-davinci-002) and ChatGPT (gpt-3.5-turbo) in creating unit tests for C++ parallel programs. Our results show that LLMs can generate mostly correct and comprehensive unit tests, although they have some limitations, such as repetitive assertions and blank test cases.


A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation using GPT

Maity, Subhankar, Deroy, Aniket, Sarkar, Sudeshna

arXiv.org Artificial Intelligence

We introduce a multi-stage prompting approach (MSP) for the generation of multiple choice questions (MCQs), harnessing the capabilities of GPT models such as text-davinci-003 and GPT-4, renowned for their excellence across various NLP tasks. Our approach incorporates the innovative concept of chain-of-thought prompting, a progressive technique in which the GPT model is provided with a series of interconnected cues to guide the MCQ generation process. Automated evaluations consistently demonstrate the superiority of our proposed MSP method over the traditional single-stage prompting (SSP) baseline, resulting in the production of high-quality distractors. Furthermore, the one-shot MSP technique enhances automatic evaluation results, contributing to improved distractor generation in multiple languages, including English, German, Bengali, and Hindi. In human evaluations, questions generated using our approach exhibit superior levels of grammaticality, answerability, and difficulty, highlighting its efficacy in various languages.


Harnessing the Power of Prompt-based Techniques for Generating School-Level Questions using Large Language Models

Maity, Subhankar, Deroy, Aniket, Sarkar, Sudeshna

arXiv.org Artificial Intelligence

Designing high-quality educational questions is a challenging and time-consuming task. In this work, we propose a novel approach that utilizes prompt-based techniques to generate descriptive and reasoning-based questions. However, current question-answering (QA) datasets are inadequate for conducting our experiments on prompt-based question generation (QG) in an educational setting. Therefore, we curate a new QG dataset called EduProbe for school-level subjects, by leveraging the rich content of NCERT textbooks. We carefully annotate this dataset as quadruples of 1) Context: a segment upon which the question is formed; 2) Long Prompt: a long textual cue for the question (i.e., a longer sequence of words or phrases, covering the main theme of the context); 3) Short Prompt: a short textual cue for the question (i.e., a condensed representation of the key information or focus of the context); 4) Question: a deep question that aligns with the context and is coherent with the prompts. We investigate several prompt-based QG methods by fine-tuning pre-trained transformer-based large language models (LLMs), namely PEGASUS, T5, MBART, and BART. Moreover, we explore the performance of two general-purpose pre-trained LLMs such as Text-Davinci-003 and GPT-3.5-Turbo without any further training. By performing automatic evaluation, we show that T5 (with long prompt) outperforms all other models, but still falls short of the human baseline. Under human evaluation criteria, TextDavinci-003 usually shows better results than other models under various prompt settings. Even in the case of human evaluation criteria, QG models mostly fall short of the human baseline. Our code and dataset are available at: https://github.com/my625/PromptQG


A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

Newman, Benjamin, Soldaini, Luca, Fok, Raymond, Cohan, Arman, Lo, Kyle

arXiv.org Artificial Intelligence

Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First, we define the requirements and challenges for this user-facing decontextualization task, such as clarifying where edits occur and handling references to other documents. Second, we propose a framework that decomposes the task into three stages: question generation, question answering, and rewriting. Using this framework, we collect gold decontextualizations from experienced scientific article readers. We then conduct a range of experiments across state-of-the-art commercial and open-source language models to identify how to best provide missing-but-relevant information to models for our task. Finally, we develop QaDecontext, a simple prompting strategy inspired by our framework that improves over end-to-end prompting. We conclude with analysis that finds, while rewriting is easy, question generation and answering remain challenging for today's models.


Ghostbuster: Detecting Text Ghostwritten by Large Language Models

Verma, Vivek, Fleisig, Eve, Tomlin, Nicholas, Klein, Dan

arXiv.org Artificial Intelligence

We introduce Ghostbuster, a state-of-the-art system for detecting AI-generated text. Our method works by passing documents through a series of weaker language models, running a structured search over possible combinations of their features, and then training a classifier on the selected features to predict whether documents are AI-generated. Crucially, Ghostbuster does not require access to token probabilities from the target model, making it useful for detecting text generated by black-box models or unknown model versions. In conjunction with our model, we release three new datasets of human-and AI-generated text as detection benchmarks in the domains of student essays, creative writing, and news articles. We compare Ghostbuster to a variety of existing detectors, including DetectGPT and GPTZero, as well as a new RoBERTa baseline. Ghostbuster achieves 99.0 F1 when evaluated across domains, which is 5.9 F1 higher than the best preexisting model. It also outperforms all previous approaches in generalization across writing domains (+7.5 F1), prompting strategies (+2.1 F1), and language models (+4.4 F1). We also analyze the robustness of our system to a variety of perturbations and paraphrasing attacks and evaluate its performance on documents written by non-native English speakers. Language models such as ChatGPT are capable of producing a wide range of fluent text that closely approximates human language use. However, the proliferation of these models has raised concerns about the authenticity and trustworthiness of text across a variety of domains. For example, fears that students are submitting assignments ghostwritten by language models have led many schools to adapt by restricting the use of ChatGPT in classrooms and homework assignments (Heaven, 2023). Meanwhile, because language models are prone to factual errors and hallucination, readers may desire to know if such tools have been used to ghostwrite news articles or other informative text when deciding whether or not to trust a source.


Server-side Rescoring of Spoken Entity-centric Knowledge Queries for Virtual Assistants

Zhang, Youyuan, Gondala, Sashank, Fraga-Silva, Thiago, Van Gysel, Christophe

arXiv.org Artificial Intelligence

On-device Virtual Assistants (VAs) powered by Automatic Speech Recognition (ASR) require effective knowledge integration for the challenging entity-rich query recognition. In this paper, we conduct an empirical study of modeling strategies for server-side rescoring of spoken information domain queries using various categories of Language Models (LMs) (N-gram word LMs, sub-word neural LMs). We investigate the combination of on-device and server-side signals, and demonstrate significant WER improvements of 23%-35% on various entity-centric query subpopulations by integrating various server-side LMs compared to performing ASR on-device only. We also perform a comparison between LMs trained on domain data and a GPT-3 variant offered by OpenAI as a baseline. Furthermore, we also show that model fusion of multiple server-side LMs trained from scratch most effectively combines complementary strengths of each model and integrates knowledge learned from domain-specific data to a VA ASR system.


Legal Question-Answering in the Indian Context: Efficacy, Challenges, and Potential of Modern AI Models

Nigam, Shubham Kumar, Mishra, Shubham Kumar, Mishra, Ayush Kumar, Shallum, Noel, Bhattacharya, Arnab

arXiv.org Artificial Intelligence

Legal QA platforms bear the promise to metamorphose the manner in which legal experts engage with jurisprudential documents. In this exposition, we embark on a comparative exploration of contemporary AI frameworks, gauging their adeptness in catering to the unique demands of the Indian legal milieu, with a keen emphasis on Indian Legal Question Answering (AILQA). Our discourse zeroes in on an array of retrieval and QA mechanisms, positioning the OpenAI GPT model as a reference point. The findings underscore the proficiency of prevailing AILQA paradigms in decoding natural language prompts and churning out precise responses. The ambit of this study is tethered to the Indian criminal legal landscape, distinguished by its intricate nature and associated logistical constraints. To ensure a holistic evaluation, we juxtapose empirical metrics with insights garnered from seasoned legal practitioners, thereby painting a comprehensive picture of AI's potential and challenges within the realm of Indian legal QA.


Cabbage Sweeter than Cake? Analysing the Potential of Large Language Models for Learning Conceptual Spaces

Chatterjee, Usashi, Gajbhiye, Amit, Schockaert, Steven

arXiv.org Artificial Intelligence

The theory of Conceptual Spaces is an influential cognitive-linguistic framework for representing the meaning of concepts. Conceptual spaces are constructed from a set of quality dimensions, which essentially correspond to primitive perceptual features (e.g. hue or size). These quality dimensions are usually learned from human judgements, which means that applications of conceptual spaces tend to be limited to narrow domains (e.g. modelling colour or taste). Encouraged by recent findings about the ability of Large Language Models (LLMs) to learn perceptually grounded representations, we explore the potential of such models for learning conceptual spaces. Our experiments show that LLMs can indeed be used for learning meaningful representations to some extent. However, we also find that fine-tuned models of the BERT family are able to match or even outperform the largest GPT-3 model, despite being 2 to 3 orders of magnitude smaller.