Large Language Model
Meta-learning for downstream aware and agnostic pretraining
Luo, Hongyin, Dong, Shuyan, Chuang, Yung-Sung, Li, Shang-Wen
Neural network pretraining is gaining attention due to its outstanding performance in natural language processing applications. However, pretraining usually leverages predefined task sequences to learn general linguistic clues. The lack of mechanisms in choosing proper tasks during pretraining makes the learning and knowledge encoding inefficient. We thus propose using meta-learning to select tasks that provide the most informative learning signals in each episode of pretraining. With the proposed method, we aim to achieve better efficiency in computation and memory usage for the pretraining process and resulting networks while maintaining the performance. In this preliminary work, we discuss the algorithm of the method and its two variants, downstream-aware and downstream-agnostic pretraining. Our experiment plan is also summarized, while empirical results will be shared in our future works.
China's GPT-3? BAAI Introduces Superscale Intelligence Model 'Wu Dao 1.0'
Since the May 2020 release of OpenAI's GPT-3, AI researchers have embraced super-large-scale pretraining models. Packing an epoch-making 175 billion parameters, GPT-3 has achieved excellent performance across multiple natural language processing (NLP) tasks. Despite their size and power however, such models still lack common sense or cognitive abilities, and so struggle with complex reasoning tasks like open dialogue, knowledge-based Q&A, visual reasoning, etc. In a bid to promote the research and development of China's own large-scale pretraining models and further explore universal intelligence from a more fundamental perspective, the Beijing Academy of Artificial Intelligence (BAAI) recently unveiled Wu Dao 1.0, China's first homegrown super-scale intelligent model system. The work was led by BAAI Research Academic Vice President and Tsinghua University Professor Tang Jie, with contributions from a team of more than 100 AI scientists from Peking University, Tsinghua University, Renmin University of China, Chinese Academy of Sciences and other institutes.
Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models
The advent of the transformer has sparked a quick growth in the size of language models, far outpacing hardware improvements. (Dense) transformers are expected to reach the trillion-parameter scale in the near future, for which training requires thousands or even tens of thousands of GPUs. We investigate the challenges of training at this scale and beyond on commercially available hardware. In particular, we analyse the shortest possible training time for different configurations of distributed training, leveraging empirical scaling laws for language models to estimate the optimal (critical) batch size. Contrary to popular belief, we find no evidence for a memory wall, and instead argue that the real limitation -- other than the cost -- lies in the training duration. In addition to this analysis, we introduce two new methods, \textit{layered gradient accumulation} and \textit{modular pipeline parallelism}, which together cut the shortest training time by half. The methods also reduce data movement, lowering the network requirement to a point where a fast InfiniBand connection is not necessary. This increased network efficiency also improve on the methods introduced with the ZeRO optimizer, reducing the memory usage to a tiny fraction of the available GPU memory.
Researchers open-source benchmarks measuring quality of AI-generated code
The applications of computer programming are vast in scope. And as computers become ubiquitous, the demand for quality code draws an ever-growing number of aspiring programmers to the profession. After years of study to become proficient at coding, experts learn to convert abstracts into concrete, executable programs. But what if AI could do the same? In recent years, large-scale AI language models have shown promise in generalizing to tasks including writing code, implying that humans' work may be one day supplemented by AI systems.
Microsoft, GPT-3, and the future of OpenAI
One of the biggest highlights of Build, Microsoft's annual software development conference, was the presentation of a tool that uses deep learning to generate source code for office applications. The tool uses GPT-3, a massive language model developed by OpenAI last year and made available to select developers, researchers, and startups in a paid application programming interface. Many have touted GPT-3 as the next-generation artificial intelligence technology that will usher in a new breed of applications and startups. Since GPT-3's release, many developers have found interesting and innovative uses for the language model. And several startups have declared that they will be using GPT-3 to build new or augment existing products. But creating a profitable and sustainable business around GPT-3 remains a challenge.
The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models
Wennberg, Ulme, Henter, Gustav Eje
Mechanisms for encoding positional information are central for transformer-based language models. In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention. The degree of translation invariance increases during training and correlates positively with model performance. Our findings lead us to propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings. Our proposal has several theoretical advantages over existing position-representation approaches. Experiments show that it improves on regular ALBERT on GLUE tasks, while only adding orders of magnitude less positional parameters.
Thoughts on the Alignment Implications of Scaling Language Models
By now, most of you have probably heard about GPT-3 and what it does. There's been a bunch of different opinions on what it means for alignment, and this post is yet another opinion from a slightly different perspective. Some background: I'm a part of EleutherAI, a decentralized research collective (read: glorified discord server - come join us on Discord for ML, alignment, and dank memes). We're best known for our ongoing effort to create a GPT-3-like large language model, and so we have a lot of experience working with transformer models and looking at scaling laws, but we also take alignment very seriously and spend a lot of time thinking about it. I also want to lay out some potential topics for future research that might be fruitful. By the way, I did consider that the scaling laws implications might be an infohazard, but I think that ship sailed the moment the GPT-3 paper went live, and since we've already been in a race for parameters for some time (see: Megatron-LM, Turing-NLG, Switch Transformer, PanGu-ฮฑ/็ๅคฮฑ, HyperCLOVA, Wudao/ๆ้ 2.0, among others), I don't really think this post is causing any non-negligible amount of desire for scaling.
multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning
Saha, Swarnadeep, Yadav, Prateek, Bansal, Mohit
We focus on a type of linguistic formal reasoning where the goal is to reason over explicit knowledge in the form of natural language facts and rules (Clark et al., 2020). A recent work, named PRover (Saha et al., 2020), performs such reasoning by answering a question and also generating a proof graph that explains the answer. However, compositional reasoning is not always unique and there may be multiple ways of reaching the correct answer. Thus, in our work, we address a new and challenging problem of generating multiple proof graphs for reasoning over natural language rule-bases. Each proof provides a different rationale for the answer, thereby improving the interpretability of such reasoning systems. In order to jointly learn from all proof graphs and exploit the correlations between multiple proofs for a question, we pose this task as a set generation problem over structured output spaces where each proof is represented as a directed graph. We propose two variants of a proof-set generation model, multiPRover. Our first model, Multilabel-multiPRover, generates a set of proofs via multi-label classification and implicit conditioning between the proofs; while the second model, Iterative-multiPRover, generates proofs iteratively by explicitly conditioning on the previously generated proofs. Experiments on multiple synthetic, zero-shot, and human-paraphrased datasets reveal that both multiPRover models significantly outperform PRover on datasets containing multiple gold proofs. Iterative-multiPRover obtains state-of-the-art proof F1 in zero-shot scenarios where all examples have single correct proofs. It also generalizes better to questions requiring higher depths of reasoning where multiple proofs are more frequent. Our code and models are publicly available at https://github.com/swarnaHub/multiPRover
Joint Retrieval and Generation Training for Grounded Text Generation
Zhang, Yizhe, Sun, Siqi, Gao, Xiang, Fang, Yuwei, Brockett, Chris, Galley, Michel, Gao, Jianfeng, Dolan, Bill
Recent advances in large-scale pre-training such as GPT-3 allow seemingly high quality text to be generated from a given prompt. However, such generation systems often suffer from problems of hallucinated facts, and are not inherently designed to incorporate useful external information. Grounded generation models appear to offer remedies, but their training typically relies on rarely-available parallel data where corresponding information-relevant documents are provided for context. We propose a framework that alleviates this data constraint by jointly training a grounded generator and document retriever on the language model signal. The model learns to reward retrieval of the documents with the highest utility in generation, and attentively combines them using a Mixture-of-Experts (MoE) ensemble to generate follow-on text. We demonstrate that both generator and retriever can take advantage of this joint training and work synergistically to produce more informative and relevant text in both prose and dialogue generation.