Large Language Model
OpenAI, Microsoft, and GitHub hit with lawsuit over Copilot
Lawyer and developer Matthew Butterick announced last month that he'd teamed up with the Joseph Saveri Law Firm to investigate Copilot. They wanted to know if and how the software infringed upon the legal rights of coders by scraping and emitting their work without proper attribution under current open-source licenses. Now, the firm has filed a class-action lawsuit in the District Court of Northern California in San Francisco. "We are challenging the legality of GitHub Copilot," Butterick said. "This is the first step in what will be a long journey. As far as we know, this is the first class-action case in the US challenging the training and output of AI systems. It will not be the last. AI systems are not exempt from the law. Those who create and operate these systems must remain accountable," he continued in a statement.
Productizing Large Language Models
Large Language Models (LLMs) are known for their near-magical ability to learn from very few examples -- as little as zero -- to create language wonders. LLMs can chat, write poetry, write code, and even do basic arithmetic. However, the same properties that make LLMs magical also make them challenging from an engineering perspective. At Replit we have deployed transformer-based language models of all sizes: 100m parameter models for search and spam, 1-10B models for a code autocomplete product we call GhostWriter, and 100B models for features that require a higher reasoning ability. In this post we'll talk about what we've learned about building and hosting large language models.
Extended Multilingual Protest News Detection -- Shared Task 1, CASE 2021 and 2022
Hürriyetoğlu, Ali, Mutlu, Osman, Duruşan, Fırat, Uca, Onur, Gürel, Alaeddin Selçuk, Radford, Benjamin, Dai, Yaoyao, Hettiarachchi, Hansi, Stoehr, Niklas, Nomoto, Tadashi, Slavcheva, Milena, Vargas, Francielle, Javid, Aaqib, Beyhan, Fatih, Yörük, Erdem
We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese \& Subtask 4 English.
TEMPERA: Test-Time Prompting via Reinforcement Learning
Zhang, Tianjun, Wang, Xuezhi, Zhou, Denny, Schuurmans, Dale, Gonzalez, Joseph E.
Careful prompt design is critical to the use of large language models in zeroshot or few-shot learning. As a consequence, there is a growing interest in automated methods to design optimal prompts. In this work, we propose TEst-tiMe Prompt Editing using Reinforcement leArning (TEMPERA). In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge, is adaptive to different queries, and provides an interpretable prompt for every query. To achieve this, we design a novel action space that allows flexible editing of the initial prompts covering a comprehensive set of commonly-used components like instructions, few-shot exemplars, and verbalizers. The proposed method achieves significant gains compared with recent SoTA approaches like prompt tuning, AutoPrompt, and RLPrompt, across a variety of tasks, including sentiment analysis, topic classification, natural language inference, and reading comprehension. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods. With the recent advances in pre-training large language models (Brown et al., 2020; Fedus et al., 2021; Raffel et al., 2020; Chowdhery et al., 2022), prompting, or in-context learning provides a dataefficient framework for performing NLU (Li & Liang, 2021; Shin et al., 2020b; Gao et al., 2020b). Such methods achieve impressive zero-shot and few-show performance in many downstream tasks. However, the prompt often has to be carefully tuned to achieve consistent performance for each task (Lu et al., 2021). For example, prompt tuning aims to optimize a continuous prefix embedding via gradient descent and directly takes generated output from the frozen pre-trained language model (Lester et al., 2021; Liu et al., 2021b;a). On the contrary, discrete prompt optimization focuses on constructing meaningful instructions, in-context exemplars and verbalizers (Brown et al., 2020; Gao et al., 2020b). Prior work often performs black-box optimization or applies RL-based methods for direct generation (Deng et al., 2022; Sun et al., 2022; Prasad et al., 2022).
Meta Trained an AI on 48M Science Papers. It Was Shut Down After 2 Days
In the first year of the pandemic, science happened at light speed. More than 100,000 papers were published on COVID in those first 12 months -- an unprecedented human effort that produced an unprecedented deluge of new information. It would have been impossible to read and comprehend every one of those studies. No human being could (and, perhaps, none would want to). Galactica is an artificial intelligence developed by Meta AI (formerly known as Facebook Artificial Intelligence Research) with the intention of using machine learning to "organize science."
How to create a zero-shot learning text classifier using Hugging Face & Streamlit!
Today I'm excited to have the opportunity to contribute to the 30DaysofStreamlit challenge via this hands-on tutorial! We will create a zero-shot learning text classifier using Hugging Face's API inference and Distilbart! With it you will have the mighty power to classify keyphrases on-the-fly, fast, and without any ML training! You can set these labels dynamically to anything, e.g.: Zero-shot learning (ZSL) differs from traditional machine learning methods as it deals with the ability to recognise objects *without* any training samples. Yet it can build and train models efficiently with the help of transferring intelligence from previously seen categories and auxiliary information.
Stanford debuts first AI benchmark to help understand LLMs
Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. In the world of artificial intelligence (AI) and machine learning (ML), 2022 has arguably been the year of foundation models, or AI models trained on a massive scale. From GPT-3 to DALL-E, from BLOOM to Imagen -- another day, it seems, another large language model (LLM) or text-to-image model. But until now, there have been no AI benchmarks to provide a standardized way to evaluate these models, which have developed at a rapidly-accelerated pace over the past couple of years. Don't miss our new special issue: Zero trust: The new security paradigm.
Modeling Fine-grained Information via Knowledge-aware Hierarchical Graph for Zero-shot Entity Retrieval
Wu, Taiqiang, Bai, Xingyu, Guo, Weigang, Liu, Weijie, Li, Siheng, Yang, Yujiu
Zero-shot entity retrieval, aiming to link mentions to candidate entities under the zero-shot setting, is vital for many tasks in Natural Language Processing. Most existing methods represent mentions/entities via the sentence embeddings of corresponding context from the Pre-trained Language Model. However, we argue that such coarse-grained sentence embeddings can not fully model the mentions/entities, especially when the attention scores towards mentions/entities are relatively low. In this work, we propose GER, a \textbf{G}raph enhanced \textbf{E}ntity \textbf{R}etrieval framework, to capture more fine-grained information as complementary to sentence embeddings. We extract the knowledge units from the corresponding context and then construct a mention/entity centralized graph. Hence, we can learn the fine-grained information about mention/entity by aggregating information from these knowledge units. To avoid the graph information bottleneck for the central mention/entity node, we construct a hierarchical graph and design a novel Hierarchical Graph Attention Network~(HGAN). Experimental results on popular benchmarks demonstrate that our proposed GER framework performs better than previous state-of-the-art models. The code has been available at https://github.com/wutaiqiang/GER-WSDM2023.
Fixing Model Bugs with Natural Language Patches
Murty, Shikhar, Manning, Christopher D., Lundberg, Scott, Ribeiro, Marco Tulio
Current approaches for fixing systematic problems in NLP models (e.g. regex patches, finetuning on more data) are either brittle, or labor-intensive and liable to shortcuts. In contrast, humans often provide corrections to each other through natural language. Taking inspiration from this, we explore natural language patches -- declarative statements that allow developers to provide corrective feedback at the right level of abstraction, either overriding the model (``if a review gives 2 stars, the sentiment is negative'') or providing additional information the model may lack (``if something is described as the bomb, then it is good''). We model the task of determining if a patch applies separately from the task of integrating patch information, and show that with a small amount of synthetic data, we can teach models to effectively use real patches on real data -- 1 to 7 patches improve accuracy by ~1-4 accuracy points on different slices of a sentiment analysis dataset, and F1 by 7 points on a relation extraction dataset. Finally, we show that finetuning on as many as 100 labeled examples may be needed to match the performance of a small set of language patches.
The Stack: 3 TB of permissively licensed source code
Kocetkov, Denis, Li, Raymond, Allal, Loubna Ben, Li, Jia, Mou, Chenghao, Ferrandis, Carlos Muñoz, Jernite, Yacine, Mitchell, Margaret, Hughes, Sean, Wolf, Thomas, Bahdanau, Dzmitry, von Werra, Leandro, de Vries, Harm
Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.