Large Language Model
Large Language Models Are State-of-the-Art Evaluators of Code Generation
Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine translation and summarization, their applicability in code generation tasks remains limited without human involvement. The complexity of programming concepts required for such tasks makes it difficult to develop evaluation metrics that align with human judgment. Token-matching-based metrics, such as BLEU, have demonstrated weak correlations with human practitioners in code generation tasks. Moreover, the utilization of human-written test suites to evaluate functional correctness can be challenging in domains with low resources. To overcome these obstacles, we propose a new evaluation framework based on the GPT-3.5 (GPT-3.5-turbo), Our framework addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences, without the need for test oracles or references. We evaluate the efficacy of our framework on two different aspects (human preference and execution success) and four programming languages, comparing its performance with the state-of-the-art CodeBERTScore metric, which relies on a pre-trained model. Our results demonstrate that our framework surpasses CodeBERTScore, delivering high levels of accuracy and consistency across various programming languages and tasks. Natural language generation (NLG) systems have seen significant progress with the development of large language models (LLMs). These models have shown great promise in generating high-quality and diverse texts that can be difficult to distinguish from human-written texts (Ouyang et al., 2022). However, evaluating the quality of NLG systems remains a challenging task, primarily due to the limitations of traditional evaluation metrics. Token-matching-based metrics, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), have been widely used to evaluate NLG systems but have demonstrated poor correlation with human judgment and a lack of ability to capture semantic meanings (Kocmi et al., 2021). Furthermore, these metrics require reference output, which can be challenging to obtain for new tasks and low-resource domains (Liu et al., 2023).
Framing the News:From Human Perception to Large Language Model Inferences
del Barrio, David Alonso, Gatica-Perez, Daniel
Identifying the frames of news is important to understand the articles' vision, intention, message to be conveyed, and which aspects of the news are emphasized. Framing is a widely studied concept in journalism, and has emerged as a new topic in computing, with the potential to automate processes and facilitate the work of journalism professionals. In this paper, we study this issue with articles related to the Covid-19 anti-vaccine movement. First, to understand the perspectives used to treat this theme, we developed a protocol for human labeling of frames for 1786 headlines of No-Vax movement articles of European newspapers from 5 countries. Headlines are key units in the written press, and worth of analysis as many people only read headlines (or use them to guide their decision for further reading.) Second, considering advances in Natural Language Processing (NLP) with large language models, we investigated two approaches for frame inference of news headlines: first with a GPT-3.5 fine-tuning approach, and second with GPT-3.5 prompt-engineering. Our work contributes to the study and analysis of the performance that these models have to facilitate journalistic tasks like classification of frames, while understanding whether the models are able to replicate human perception in the identification of these frames.
Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling
Nottingham, Kolby, Ammanabrolu, Prithviraj, Suhr, Alane, Choi, Yejin, Hajishirzi, Hannaneh, Singh, Sameer, Fox, Roy
Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world. However, if initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that will be verified through world experience, to improve sample efficiency of RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.
Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning
Ye, Yunhu, Hui, Binyuan, Yang, Min, Li, Binhua, Huang, Fei, Li, Yongbin
Table-based reasoning has shown remarkable progress in combining deep models with discrete reasoning, which requires reasoning over both free-form natural language (NL) questions and structured tabular data. However, previous table-based reasoning solutions usually suffer from significant performance degradation on huge evidence (tables). In addition, most existing methods struggle to reason over complex questions since the required information is scattered in different places. To alleviate the above challenges, we exploit large language models (LLMs) as decomposers for effective table-based reasoning, which (i) decompose huge evidence (a huge table) into sub-evidence (a small table) to mitigate the interference of useless information for table reasoning; and (ii) decompose complex questions into simpler sub-questions for text reasoning. Specifically, we first use the LLMs to break down the evidence (tables) involved in the current question, retaining the relevant evidence and excluding the remaining irrelevant evidence from the huge table. In addition, we propose a "parsing-execution-filling" strategy to alleviate the hallucination dilemma of the chain of thought by decoupling logic and numerical computation in each step. Extensive experiments show that our method can effectively leverage decomposed evidence and questions and outperforms the strong baselines on TabFact, WikiTableQuestion, and FetaQA datasets. Notably, our model outperforms human performance for the first time on the TabFact dataset.
ChatLog: Recording and Analyzing ChatGPT Across Time
Tu, Shangqing, Li, Chunyang, Yu, Jifan, Wang, Xiaozhi, Hou, Lei, Li, Juanzi
While there are abundant researches about evaluating ChatGPT on natural language understanding and generation tasks, few studies have investigated how ChatGPT's behavior changes over time. In this paper, we collect a coarse-to-fine temporal dataset called ChatLog, consisting of two parts that update monthly and daily: ChatLog-Monthly is a dataset of 38,730 question-answer pairs collected every month including questions from both the reasoning and classification tasks. ChatLog-Daily, on the other hand, consists of ChatGPT's responses to 1000 identical questions for long-form generation every day. We conduct comprehensive automatic and human evaluation to provide the evidence for the existence of ChatGPT evolving patterns. We further analyze the unchanged characteristics of ChatGPT over time by extracting its knowledge and linguistic features. We find some stable features to improve the robustness of a RoBERTa-based detector on new versions of ChatGPT. We will continuously maintain our project at https://github.com/THU-KEG/ChatLog.
ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT
Ubani, Solomon, Polat, Suleyman Olcay, Nielsen, Rodney
Data augmentation is a technique to increase the size of the training data available to machine learning models without requiring additional human annotation of data. Increasing the size of training data, provided the additional data is somewhat diverse, is pertinent to enable model generalization especially in low resource tasks. The aim of this paper is to evaluate zero-shot prompting of ChatGPT for data augmentation in the low resource scenario. Wei and Zou [14] proposed Easy Data Augmentation (EDA) which is a technique based on word replacement that includes four types of operations: synonym replacement, random insertion, random deletion, and random swap. In synonym replacement, words with similar meanings are substituted for some of the original words in the text.
Artificial intelligence: Frequently asked questions about AI
FOX Business correspondent Lydia Hu has the latest on jobs at risk as AI further develops on'America's Newsroom.' The advancement of artificial intelligence is progressing at a breakneck pace. While the technology is changing rapidly, the basic principles behind AI aren't new. Artificial intelligence has been around for many years, and has been built upon by many different developers. Today, some of the most well known AIs include chatbots like ChatGPT and Google Bard, with many more to come.
Palantir shows off an AI that can go to war
Palantir already sells its domestic surveillance services to US Immigration and Customs Enforcement, so it should come as no surprise that the company founded by billionaire Peter Thiel is working to make inroads into the Pentagon as well. On Tuesday, the company released a video demo of its latest offering, the Palantir Artificial Intelligence Platform (AIP). While the system itself is simply designed to integrate large language models (LLMs) like OpenAI's GPT-4 or Google's BERT into privately-operated networks, the very first thing they did was apply it to the modern battlefield. In the video demo above, a military operator tasked with monitoring the Eastern European theater discovers enemy forces massing near the border and responds by asking a ChatGPT-style digital assistant for help with deploying reconnaissance drones, ginning up tactical responses to the perceived aggression and even organize the jamming of the enemy's communications. The AIP is shown helping estimate the enemy's composition and capabilities by launching a Reaper drone on a reconnaissance mission in response the to operator's request for better pictures, and suggesting appropriate responses given the discovery of an armored element.
OpenAI rolls out new ChatGPT features including ability to go incognito
Fox News correspondent Grady Trimble has the latest on fears the technology will spiral out of control on'Special Report.' Artificial intelligence leader OpenAI has introduced the ability to turn off chat history in its popular chatbot ChatGPT. In a Tuesday blog post, the company said conversations that are started when chat history is disabled will not be used to train and improve its models and will not appear in the history sidebar. The controls are found in the ChatGPT settings and can be changed at any time. The mode rolled out ot all users.
The Download: introducing The Education issue
Welcome to the Education Issue, our latest print magazine. It's becoming increasingly clear that we're in an entirely new place when it comes to the use of AI in education, and it is far from clear what that is going to mean. The world has changed, and there's no going back. Technologies like ChatGPT, OpenAI's massively mind-blowing generative AI software, will have all sorts of genuinely useful and transformative applications in the classroom. Yes, they will almost certainly also be used for cheating.