Large Language Model
AI watch: UK electoral warning and OpenAI's move into London
Artificial intelligence is either going to save humanity or finish it off, depending on who you speak to. Either way, every week there are new developments and breakthroughs. The US company behind the ChatGPT chatbot, OpenAI, has announced that its first international office will be in London. The move is a boost for the UK prime minister, Rishi Sunak, who has described the AI race as one of the "greatest opportunities" for the country's tech industry. OpenAI said it chose the UK capital because of its "rich culture and exceptional talent pool".
Could AI movies like 'The Matrix' and 'Her' become a reality? Experts weigh in
Veritone CEO Ryan Steelberg says the Writers Guild needs to make sure its writers are protected as AI becomes more popular. While watching a film, viewers might ponder its legitimacy, questioning if what they're seeing on screen can happen in real life. Artificial intelligence is no different. AI is being heavily developed and utilized now to edit and amplify films, but the depths of its use has also been explored in futuristic, science-fiction movies like "The Matrix" or "I, Robot." With the rise of AI platforms, including ChatGPT, AI appears to be infiltrating every industry.
How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain
Recent advancements in language models (LMs) have led to the emergence of powerful models such as Small LMs (e.g., T5) and Large LMs (e.g., GPT-4). These models have demonstrated exceptional capabilities across a wide range of tasks, such as name entity recognition (NER) in the general domain. (We define SLMs as pre-trained models with fewer parameters compared to models like GPT-3/3.5/4, such as T5, BERT, and others.) Nevertheless, their efficacy in the medical section remains uncertain and the performance of medical NER always needs high accuracy because of the particularity of the field. This paper aims to provide a thorough investigation to compare the performance of LMs in medical few-shot NER and answer How far is LMs from 100\% Few-shot NER in Medical Domain, and moreover to explore an effective entity recognizer to help improve the NER performance. Based on our extensive experiments conducted on 16 NER models spanning from 2018 to 2023, our findings clearly indicate that LLMs outperform SLMs in few-shot medical NER tasks, given the presence of suitable examples and appropriate logical frameworks. Despite the overall superiority of LLMs in few-shot medical NER tasks, it is important to note that they still encounter some challenges, such as misidentification, wrong template prediction, etc. Building on previous findings, we introduce a simple and effective method called \textsc{RT} (Retrieving and Thinking), which serves as retrievers, finding relevant examples, and as thinkers, employing a step-by-step reasoning process. Experimental results show that our proposed \textsc{RT} framework significantly outperforms the strong open baselines on the two open medical benchmark datasets
Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks
Levinstein, B. A., Herrmann, Daniel A.
We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we evaluate two existing approaches, one due to Azaria and Mitchell (2023) and the other to Burns et al. (2022). We provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. We conclude by suggesting some concrete paths for future work.
Large Language Models (GPT) for automating feedback on programming assignments
Pankiewicz, Maciej, Baker, Ryan S.
Addressing the challenge of generating personalized feedback for programming assignments is demanding due to several factors, like the complexity of code syntax or different ways to correctly solve a task. In this experimental study, we automated the process of feedback generation by employing OpenAI's GPT-3.5 model to generate personalized hints for students solving programming assignments on an automated assessment platform. Students rated the usefulness of GPT-generated hints positively. The experimental group (with GPT hints enabled) relied less on the platform's regular feedback but performed better in terms of percentage of successful submissions across consecutive attempts for tasks, where GPT hints were enabled. For tasks where the GPT feedback was made unavailable, the experimental group needed significantly less time to solve assignments. Furthermore, when GPT hints were unavailable, students in the experimental condition were initially less likely to solve the assignment correctly. This suggests potential over-reliance on GPT-generated feedback. However, students in the experimental condition were able to correct reasonably rapidly, reaching the same percentage correct after seven submission attempts. The availability of GPT hints did not significantly impact students' affective state.
SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding
Bashlovkina, Vasilisa, Matthews, Riley, Kuang, Zhaobin, Baumgartner, Simon, Bendersky, Michael
We study the ability of transformer-based language models (LMs) to understand social media language. Social media (SM) language is distinct from standard written language, yet existing benchmarks fall short of capturing LM performance in this socially, economically, and politically important domain. We quantify the degree to which social media language differs from conventional language and conclude that the difference is significant both in terms of token distribution and rate of linguistic shift. Next, we introduce a new benchmark for Social MedIa Language Evaluation (SMILE) that covers four SM platforms and eleven tasks. Finally, we show that learning a tokenizer and pretraining on a mix of social media and conventional language yields an LM that outperforms the best similar-sized alternative by 4.2 points on the overall SMILE score.
Meta-training with Demonstration Retrieval for Efficient Few-shot Learning
Mueller, Aaron, Narang, Kanika, Mathias, Lambert, Wang, Qifan, Firooz, Hamed
Large language models show impressive results on few-shot NLP tasks. However, these models are memory and computation-intensive. Meta-training allows one to leverage smaller models for few-shot generalization in a domain-general and task-agnostic manner; however, these methods alone results in models that may not have sufficient parameterization or knowledge to adapt quickly to a large variety of tasks. To overcome this issue, we propose meta-training with demonstration retrieval, where we use a dense passage retriever to retrieve semantically similar labeled demonstrations to each example for more varied supervision. By separating external knowledge from model parameters, we can use meta-training to train parameter-efficient models that generalize well on a larger variety of tasks. We construct a meta-training set from UnifiedQA and CrossFit, and propose a demonstration bank based on UnifiedQA tasks. To our knowledge, our work is the first to combine retrieval with meta-training, to use DPR models to retrieve demonstrations, and to leverage demonstrations from many tasks simultaneously, rather than randomly sampling demonstrations from the training set of the target task. Our approach outperforms a variety of targeted parameter-efficient and retrieval-augmented few-shot methods on QA, NLI, and text classification tasks (including SQuAD, QNLI, and TREC). Our approach can be meta-trained and fine-tuned quickly on a single GPU.
Queer People are People First: Deconstructing Sexual Identity Stereotypes in Large Language Models
Dhingra, Harnoor, Jayashanker, Preetiha, Moghe, Sayali, Strubell, Emma
Large Language Models (LLMs) are trained primarily on minimally processed web text, which exhibits the same wide range of social biases held by the humans who created that content. Consequently, text generated by LLMs can inadvertently perpetuate stereotypes towards marginalized groups, like the LGBTQIA+ community. In this paper, we perform a comparative study of how LLMs generate text describing people with different sexual identities. Analyzing bias in the text generated by an LLM using regard score shows measurable bias against queer people. We then show that a post-hoc method based on chain-of-thought prompting using SHAP analysis can increase the regard of the sentence, representing a promising approach towards debiasing the output of LLMs in this setting.
Transformers in Healthcare: A Survey
Nerella, Subhash, Bandyopadhyay, Sabyasachi, Zhang, Jiaqing, Contreras, Miguel, Siegel, Scott, Bumin, Aysegul, Silva, Brandon, Sena, Jessica, Shickel, Benjamin, Bihorac, Azra, Khezeli, Kia, Rashidi, Parisa
In contrast, transformers employ a "Scaled Dot-Product Attention" mechanism that is parallelizable. This unique attention mechanism allows for large-scale pretraining. Additionally, self-supervised pretraining paradigm such as masked language modeling onlarge unlabeled datasets enabled transformers to be trained without costly annotations. Transformer model, although originally designed for the NLP [3] domain, Transformers have witnessed adaptations in various domains such as computer vision [5, 6], remote sensing [7], time series [8], speech processing [9] and multimodal learning [10]. Consequently, modality specific surveys emerged, focusing on medical imaging [11-13] and biomedical language models [14] in the medical domain. This paper aims to provide comprehensive overview of Transformer models utilized across multiple modalities of data to address healthcare objectives. We discuss pre-training strategies to manage the lack of robust and annotated healthcare datasets. The rest of the paper is organized as follows: Section 2 discusses the strategy to search for relevant citations; Section 3 describes the architecture of the original transformer; Section 4 describes the two primary Transformer variants: the Bidirectional Encoder Representations from Transformers (BERT) and the Vision Transformer (ViT). Section 5 describes advancements in large language models (LLM), and section 6 through 12 provides a review of Transformers in healthcare.
Stay on topic with Classifier-Free Guidance
Sanchez, Guillaume, Fan, Honglu, Spangher, Alexander, Levi, Elad, Ammanamanchi, Pawan Sasanka, Biderman, Stella
Classifier-Free Guidance (CFG) [37] has recently emerged in text-to-image generation as a lightweight technique to encourage prompt-adherence in generations. In this work, we demonstrate that CFG can be used broadly as an inference-time technique in pure language modeling. We show that CFG (1) improves the performance of Pythia, GPT-2 and LLaMA-family models across an array of tasks: Q&A, reasoning, code generation, and machine translation, achieving SOTA on LAMBADA with LLaMA-7B over PaLM-540B; (2) brings improvements equivalent to a model with twice the parameter-count; (3) can stack alongside other inference-time methods like Chain-of-Thought and Self-Consistency, yielding further improvements in difficult tasks; (4) can be used to increase the faithfulness and coherence of assistants in challenging form-driven and content-driven prompts: in a human evaluation we show a 75% preference for GPT4All using CFG over baseline.