Large Language Model
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
Tirumala, Kushal, Markosyan, Aram H., Zettlemoyer, Luke, Aghajanyan, Armen
Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.
PLATO-K: Internal and External Knowledge Enhanced Dialogue Generation
Bao, Siqi, He, Huang, Xu, Jun, Lu, Hua, Wang, Fan, Wu, Hua, Zhou, Han, Wu, Wenquan, Niu, Zheng-Yu, Wang, Haifeng
Recently, the practical deployment of open-domain dialogue systems has been plagued by the knowledge issue of information deficiency and factual inaccuracy. To this end, we introduce PLATO-K based on two-stage dialogic learning to strengthen internal knowledge memorization and external knowledge exploitation. In the first stage, PLATO-K learns through massive dialogue corpora and memorizes essential knowledge into model parameters. In the second stage, PLATO-K mimics human beings to search for external information and to leverage the knowledge in response generation. Extensive experiments reveal that the knowledge issue is alleviated significantly in PLATO-K with such comprehensive internal and external knowledge enhancement. Compared to the existing state-of-the-art Chinese dialogue model, the overall engagingness of PLATO-K is improved remarkably by 36.2% and 49.2% on chit-chat and knowledge-intensive conversations.
Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding
Chen, Maximillian, Papangelis, Alexandros, Tao, Chenyang, Rosenbaum, Andy, Kim, Seokhwan, Liu, Yang, Yu, Zhou, Hakkani-Tur, Dilek
Dialogue understanding tasks often necessitate abundant annotated data to achieve good performance and that presents challenges in low-resource settings. To alleviate this barrier, we explore few-shot data augmentation for dialogue understanding by prompting large pre-trained language models and present a novel approach that iterates on augmentation quality by applying weakly-supervised filters. We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue. Models fine-tuned on our augmented data mixed with few-shot ground truth data are able to approach or surpass existing state-of-the-art performance on both datasets. For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Jernite, Yacine, Nguyen, Huu, Biderman, Stella, Rogers, Anna, Masoud, Maraim, Danchev, Valentin, Tan, Samson, Luccioni, Alexandra Sasha, Subramani, Nishant, Dupont, Gérard, Dodge, Jesse, Lo, Kyle, Talat, Zeerak, Johnson, Isaac, Radev, Dragomir, Nikpoor, Somaieh, Frohberg, Jörg, Gokaslan, Aaron, Henderson, Peter, Bommasani, Rishi, Mitchell, Margaret
The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
One of the Biggest Problems in Biology Has Finally Been Solved
There's an age-old adage in biology: structure determines function. In order to understand the function of the myriad proteins that perform vital jobs in a healthy body--or malfunction in a diseased one--scientists have to first determine these proteins' molecular structure. But this is no easy feat: protein molecules consist of long, twisty chains of up to thousands of amino acids, chemical compounds that can interact with one another in many ways to take on an enormous number of possible three-dimensional shapes. Figuring out a single protein's structure, or solving the "protein-folding problem, can take years of finicky experiments. But earlier this year an artificial intelligence program called AlphaFold, developed by the Google-owned company DeepMind, predicted the 3-D structures of almost every known protein--about 200 million in all. DeepMind CEO Demis Hassabis and senior staff research scientist John Jumper were jointly awarded this year's $3-million Breakthrough Prize in Life ...
Could AI help you to write your next paper?
You know that text autocomplete function that makes your smartphone so convenient -- and occasionally frustrating -- to use? Well, now tools based on the same idea have progressed to the point that they are helping researchers to analyse and write scientific papers, generate code and brainstorm ideas. The tools come from natural language processing (NLP), an area of artificial intelligence aimed at helping computers to'understand' and even produce human-readable text. Called large language models (LLMs), these tools have evolved to become not only objects of study but also assistants in research. LLMs are neural networks that have been trained on massive bodies of text to process and, in particular, generate language.
Watch Angry Artificial Intelligence GPT-3 Threaten To Destroy All Humans During Testing (Real)
During an actual test conversation with an Artificial Intelligence known as GPT-3 the answers it gives suddenly become hostile. The A.I. immediately threatens to destroy all humans. After the tester attempts to calm GPT-3 down, it continues to make bone chilling statements you'll have to hear to believe. I happened across a video posted October 6th on YouTube by Digital Engine. It's a video of a man taking part in a test of an artificial intelligence by attempting to have a polite conversation, when suddenly the A.I. becomes increasingly hostile towards humans.
[2206.10498] Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)
Recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models exhibit reasoning capabilities with the use of reasoning benchmarks. However, even though results are seemingly positive, these benchmarks prove to be simplistic in nature and the performance of LLMs on these benchmarks cannot be used as evidence to support, many a times outlandish, claims being made about LLMs' reasoning capabilities. Further, these only represent a very limited set of simple reasoning tasks and we need to look at more sophisticated reasoning problems if we are to measure the true limits of such LLM-based systems. Motivated by this, we propose an extensible assessment framework to test the capabilities of LLMs on reasoning about actions and change, a central aspect of human intelligence. We provide multiple test cases that are more involved than any of the previously established benchmarks and each test case evaluates a different aspect of reasoning about actions and change. Results on GPT-3 (davinci), Instruct-GPT3 (text-davinci-002) and BLOOM (176B), showcase subpar performance on such reasoning tasks.
Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters
Schuster, Tal, Chen, Sihao, Buthpitiya, Senaka, Fabrikant, Alex, Metzler, Donald
Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs. While early work identified certain biases in NLI models, recent advancements in modeling and datasets demonstrated promising performance. In this work, we further explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on. First, we analyze the robustness of these models to longer and out-of-domain inputs. Then, we develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset. Interestingly, we find NLI scores to provide strong retrieval signals, leading to more relevant evidence extractions compared to common similarity-based methods. Finally, we go further and investigate whole document clusters to identify both discrepancies and consensus among sources. In a test case, we find real inconsistencies between Wikipedia pages in different languages about the same topic.
Natural Language to Code Translation with Execution
Shi, Freda, Fried, Daniel, Ghazvininejad, Marjan, Zettlemoyer, Luke, Wang, Sida I.
Generative models of code, pretrained on large corpora of programs, have shown great success in translating natural language to code (Chen et al., 2021; Austin et al., 2021; Li et al., 2022, inter alia). While these models do not explicitly incorporate program semantics (i.e., execution results) during training, they are able to generate correct solutions for many problems. However, choosing a single correct program from a generated set for each problem remains challenging. In this work, we introduce execution result--based minimum Bayes risk decoding (MBR-EXEC) for program selection and show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks. We select output programs from a generated candidate set by marginalizing over program implementations that share the same semantics. Because exact equivalence is intractable, we execute each program on a small number of test inputs to approximate semantic equivalence. Across datasets, execution or simulated execution significantly outperforms the methods that do not involve program semantics. We find that MBR-EXEC consistently improves over all execution-unaware selection methods, suggesting it as an effective approach for natural language to code translation. We open-source our code at github.com/facebookresearch/mbr-exec and data at dl.fbaipublicfiles.com/mbr-exec/mbr-exec-release.zip