Large Language Model
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training
Bonatti, Rogerio, Vemprala, Sai, Ma, Shuang, Frujeri, Felipe, Chen, Shuhang, Kapoor, Ashish
Robotics has long been a field riddled with complex systems architectures whose modules and connections, whether traditional or learning-based, require significant human expertise and prior knowledge. Inspired by large pre-trained language models, this work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot. We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. Through autoregressive prediction of states and actions over time, our model implicitly encodes dynamics and behaviors for a particular robot. Our experimental evaluation focuses on the domain of mobile agents, where we show that this robot-specific representation can function as a single starting point to achieve distinct tasks such as safe navigation, localization and mapping. We evaluate two form factors: a wheeled robot that uses a LiDAR sensor as perception input (MuSHR), and a simulated agent that uses first-person RGB images (Habitat). We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously, and comparable performance to training a separate large model for each task independently. By sharing a common good-quality representation across tasks we can lower overall model capacity and speed up the real-time deployment of such systems.
In-context Learning and Induction Heads
Olsson, Catherine, Elhage, Nelson, Nanda, Neel, Joseph, Nicholas, DasSarma, Nova, Henighan, Tom, Mann, Ben, Askell, Amanda, Bai, Yuntao, Chen, Anna, Conerly, Tom, Drain, Dawn, Ganguli, Deep, Hatfield-Dodds, Zac, Hernandez, Danny, Johnston, Scott, Jones, Andy, Kernion, Jackson, Lovitt, Liane, Ndousse, Kamal, Amodei, Dario, Brown, Tom, Clark, Jack, Kaplan, Jared, McCandlish, Sam, Olah, Chris
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.
Multiple-Choice Question Generation: Towards an Automated Assessment Framework
Automated question generation is an important approach to enable personalisation of English comprehension assessment. Recently, transformer-based pretrained language models have demonstrated the ability to produce appropriate questions from a context paragraph. Typically, these systems are evaluated against a reference set of manually generated questions using n-gram based metrics, or manual qualitative assessment. Here, we focus on a fully automated multiple-choice question generation (MCQG) system where both the question and possible answers must be generated from the context paragraph. Applying n-gram based approaches is challenging for this form of system as the reference set is unlikely to capture the full range of possible questions and answer options. Conversely manual assessment scales poorly and is expensive for MCQG system development. In this work, we propose a set of performance criteria that assess different aspects of the generated multiple-choice questions of interest. These qualities include: grammatical correctness, answerability, diversity and complexity. Initial systems for each of these metrics are described, and individually evaluated on standard multiple-choice reading comprehension corpora.
DeepMind's new chatbot uses Google searches plus humans to give better answers
The difference between this approach and its predecessors is that DeepMind hopes to use "dialogue in the long term for safety," says Geoffrey Irving, a safety researcher at DeepMind. "That means we don't expect that the problems that we face in these models--either misinformation or stereotypes or whatever--are obvious at first glance, and we want to talk through them in detail. And that means between machines and humans as well," he says. DeepMind's idea of using human preferences to optimize how an AI model learns is not new, says Sara Hooker, who leads Cohere for AI, a nonprofit AI research lab. "But the improvements are convincing and show clear benefits to human-guided optimization of dialogue agents in a large-language-model setting," says Hooker. Douwe Kiela, a researcher at AI startup Hugging Face, says Sparrow is "a nice next step that follows a general trend in AI, where we are more seriously trying to improve the safety aspects of large-language-model deployments."
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Singh, Ishika, Blukis, Valts, Mousavian, Arsalan, Goyal, Ankit, Xu, Danfei, Tremblay, Jonathan, Fox, Dieter, Thomason, Jesse, Garg, Animesh
Everyday household tasks require both commonsense understanding of the world and situated knowledge about the words, which then need to be mapped to actions and world current environment. To create a task plan for "Make dinner," objects available to the agent. For example, if the LLM an agent needs common sense: object affordances, such as produced "reach in and pick up the jar of pickles," that that the stove and microwave can be used for heating; logical string would have to neatly map to an executable action like sequences of actions, such as an oven must be preheated before "pick up jar." A key component missing in LLM-based task food is added; and task relevance of objects and actions, planning is state feedback from the environment. The fridge such as heating and food are actions related to "dinner" in the in the house might not contain chicken, soda, or pickles, first place. However, this reasoning is infeasible without state but a high-level instruction "Make dinner" doesn't give us feedback. The agent needs to know what food is available in that world state information. Our work introduces situatedawareness the current environment, such as whether the freezer contains in LLM-based robot task planning.
A Case Report On The "A.I. Locked-In Problem": social concerns with modern NLP
Modern NLP models are becoming better conversational agents than their predecessors. Recurrent Neural Networks (RNNs) and especially Long-Short Term Memory (LSTM) features allow the agent to better store and use information about semantic content, a trend that has become even more pronounced with the Transformer Models. Large Language Models (LLMs) such as GPT-3 by OpenAI have become known to be able to construct and follow a narrative, which enables the system to adopt personas on the go, adapt them and play along in conversational stories. However, practical experimentation with GPT-3 shows that there is a recurring problem with these modern NLP systems, namely that they can "get stuck" in the narrative so that further conversations, prompt executions or commands become futile. This is here referred to as the "Locked-In Problem" and is exemplified with an experimental case report, followed by practical and social concerns that are accompanied with this problem.
Scope of Pre-trained Language Models for Detecting Conflicting Health Information
Gatto, Joseph, Basak, Madhusudan, Preum, Sarah M.
An increasing number of people now rely on online platforms to meet their health information needs. Thus identifying inconsistent or conflicting textual health information has become a safety-critical task. Health advice data poses a unique challenge where information that is accurate in the context of one diagnosis can be conflicting in the context of another. For example, people suffering from diabetes and hypertension often receive conflicting health advice on diet. This motivates the need for technologies which can provide contextualized, user-specific health advice. A crucial step towards contextualized advice is the ability to compare health advice statements and detect if and how they are conflicting. This is the task of health conflict detection (HCD). Given two pieces of health advice, the goal of HCD is to detect and categorize the type of conflict. It is a challenging task, as (i) automatically identifying and categorizing conflicts requires a deeper understanding of the semantics of the text, and (ii) the amount of available data is quite limited. In this study, we are the first to explore HCD in the context of pre-trained language models. We find that DeBERTa-v3 performs best with a mean F1 score of 0.68 across all experiments. We additionally investigate the challenges posed by different conflict types and how synthetic data improves a model's understanding of conflict-specific semantics. Finally, we highlight the difficulty in collecting real health conflicts and propose a human-in-the-loop synthetic data augmentation approach to expand existing HCD datasets. Our HCD training dataset is over 2x bigger than the existing HCD dataset and is made publicly available on Github.
Bias at a Second Glance: A Deep Dive into Bias for German Educational Peer-Review Data Modeling
Wambsganss, Thiemo, Swamy, Vinitra, Rietsche, Roman, Käser, Tanja
Natural Language Processing (NLP) has become increasingly utilized to provide adaptivity in educational applications. However, recent research has highlighted a variety of biases in pre-trained language models. While existing studies investigate bias in different domains, they are limited in addressing fine-grained analysis on educational and multilingual corpora. In this work, we analyze bias across text and through multiple architectures on a corpus of 9,165 German peer-reviews collected from university students over five years. Notably, our corpus includes labels such as helpfulness, quality, and critical aspect ratings from the peer-review recipient as well as demographic attributes. We conduct a Word Embedding Association Test (WEAT) analysis on (1) our collected corpus in connection with the clustered labels, (2) the most common pre-trained German language models (T5, BERT, and GPT-2) and GloVe embeddings, and (3) the language models after fine-tuning on our collected data-set. In contrast to our initial expectations, we found that our collected corpus does not reveal many biases in the co-occurrence analysis or in the GloVe embeddings. However, the pre-trained German language models find substantial conceptual, racial, and gender bias and have significant changes in bias across conceptual and racial axes during fine-tuning on the peer-review data. With our research, we aim to contribute to the fourth UN sustainability goal (quality education) with a novel dataset, an understanding of biases in natural language education data, and the potential harms of not counteracting biases in language models for educational tasks.
Selecting Better Samples from Pre-trained LLMs: A Case Study on Question Generation
Yuan, Xingdi, Wang, Tong, Wang, Yen-Hsiang, Fine, Emery, Abdelghani, Rania, Lucas, Pauline, Sauzéon, Hélène, Oudeyer, Pierre-Yves
Large Language Models (LLMs) have in recent years demonstrated impressive prowess in natural language generation. A common practice to improve generation diversity is to sample multiple outputs from the model. However, there lacks a simple and robust way of selecting the best output from these stochastic samples. As a case study framed in the context of question generation, we propose two prompt-based approaches to selecting high-quality questions from a set of LLM-generated candidates. Our method works under the constraints of 1) a black-box (non-modifiable) question generation model and 2) lack of access to human-annotated references -- both of which are realistic limitations for real-world deployment of LLMs. With automatic as well as human evaluations, we empirically demonstrate that our approach can effectively select questions of higher qualities than greedy generation.
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
Hong, Seongmin, Moon, Seungjae, Kim, Junsoo, Lee, Sungjae, Kim, Minsub, Lee, Dongsoo, Kim, Joo-Young
Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.