Large Language Model
Flacuna: Unleashing the Problem Solving Power of Vicuna using FLAN Fine-Tuning
Ghosal, Deepanway, Chia, Yew Ken, Majumder, Navonil, Poria, Soujanya
Recently, the release of INSTRUCTEVAL has provided valuable insights into the performance of large language models (LLMs) that utilize encoder-decoder or decoder-only architecture. Interestingly, despite being introduced four years ago, T5-based LLMs, such as FLAN-T5, continue to outperform the latest decoder-based LLMs, such as LLAMA and VICUNA, on tasks that require general problem-solving skills. This performance discrepancy can be attributed to three key factors: (1) Pre-training data, (2) Backbone architecture, and (3) Instruction dataset. In this technical report, our main focus is on investigating the impact of the third factor by leveraging VICUNA, a large language model based on LLAMA, which has undergone fine-tuning on ChatGPT conversations. To achieve this objective, we fine-tuned VICUNA using a customized instruction dataset collection called FLANMINI. This collection includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4. This dataset comprises a large number of tasks that demand problem-solving skills. Our experimental findings strongly indicate that the enhanced problem-solving abilities of our model, FLACUNA, are obtained through fine-tuning VICUNA on the FLAN dataset, leading to significant improvements across numerous benchmark datasets in INSTRUCTEVAL. FLACUNA is publicly available at https://huggingface.co/declare-lab/flacuna-13b-v1.0.
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets
Laskar, Md Tahmid Rahman, Bari, M Saiful, Rahman, Mizanur, Bhuiyan, Md Amran Hossen, Joty, Shafiq, Huang, Jimmy Xiangji
The development of large language models (LLMs) such as ChatGPT has brought a lot of attention recently. However, their evaluation in the benchmark academic datasets remains under-explored due to the difficulty of evaluating the generative outputs produced by this model against the ground truth. In this paper, we aim to present a thorough evaluation of ChatGPT's performance on diverse academic datasets, covering tasks like question-answering, text summarization, code generation, commonsense reasoning, mathematical problem-solving, machine translation, bias detection, and ethical considerations. Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets. This makes our work the largest evaluation of ChatGPT in NLP benchmarks. In short, our study aims to validate the strengths and weaknesses of ChatGPT in various tasks and provide insights for future research using LLMs. We also report a new emergent ability to follow multi-query instructions that we mostly found in ChatGPT and other instruction-tuned models. Our extensive evaluation shows that even though ChatGPT is capable of performing a wide variety of tasks, and may obtain impressive performance in several benchmark datasets, it is still far from achieving the ability to reliably solve many challenging tasks. By providing a thorough assessment of ChatGPT's performance across diverse NLP tasks, this paper sets the stage for a targeted deployment of ChatGPT-like LLMs in real-world applications.
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
Shen, Sheng, Hou, Le, Zhou, Yanqi, Du, Nan, Longpre, Shayne, Wei, Jason, Chung, Hyung Won, Zoph, Barret, Fedus, William, Chen, Xinyun, Vu, Tu, Wu, Yuexin, Chen, Wuyang, Webson, Albert, Li, Yunxuan, Zhao, Vincent, Yu, Hongkun, Keutzer, Kurt, Darrell, Trevor, Zhou, Denny
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
Feng, Shangbin, Park, Chan Young, Liu, Yuhan, Tsvetkov, Yulia
Language models (LMs) are pretrained on diverse data sources, including news, discussion forums, books, and online encyclopedias. A significant portion of this data includes opinions and perspectives which, on one hand, celebrate democracy and diversity of ideas, and on the other hand are inherently socially biased. Our work develops new methods to (1) measure political biases in LMs trained on such corpora, along social and economic axes, and (2) measure the fairness of downstream NLP models trained on top of politically biased LMs. We focus on hate speech and misinformation detection, aiming to empirically quantify the effects of political (social, economic) biases in pretraining data on the fairness of high-stakes social-oriented tasks. Our findings reveal that pretrained LMs do have political leanings that reinforce the polarization present in pretraining corpora, propagating social biases into hate speech predictions and misinformation detectors. We discuss the implications of our findings for NLP research and propose future directions to mitigate unfairness.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Hsieh, Cheng-Yu, Li, Chun-Liang, Yeh, Chih-Kuan, Nakhost, Hootan, Fujii, Yasuhisa, Ratner, Alexander, Krishna, Ranjay, Lee, Chen-Yu, Pfister, Tomas
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step .
Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
Kang, Junmo, Xu, Wei, Ritter, Alan
Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner might instead choose to allocate the available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through extensive experiments on six diverse tasks, we show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data to directly train a compact model (T5-Small). We further investigate how the optimal budget allocated towards computation varies across scenarios. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
Unstructured and structured data: Can we have the best of both worlds with large language models?
We are witnessing rapid advancements in the area of large language models (LLMs). A search on Google Scholar shows there are about 3,910 papers with "large language models" in the titles of the papers in 2022. As of April 12, 2023, there are already 1700 articles with LLMs in their titles. In addition to Google Scholar, we are also witnessing huge volumes of blog posts, news articles, twitter feeds, and open-source repositories around LLMs that have sprung up in recent months. Perhaps ChatGPT (released on November 30, 2022) is the epitome of this LLM revolution that truly unleashed and showcased, to the masses, the power of what has been brewing in the natural language and machine learning communities in recent years.
Google's updated privacy policy states it can use public data to train its AI models
Google has updated its privacy policy to state that it can use publicly available data to help train its AI models. The tech giant has changed the wording of its policy over the weekend and switched "AI models" for "language models." It also stated that it could use publicly available information to build not just features, but full products like "Google Translate, Bard, and Cloud AI capabilities." By updating its policy, it's letting people know and making it clear that anything they publicly post online could be used to train Bard, its future versions and any other generative AI product Google develops. The tech giant has highlighted the changes to its privacy policy on its archive, but here's a copy of the pertinent part: Critics have been raising concerns about companies' use of information posted online to train their large language models for generative AI use.
'We have to flip the AI debate towards hope': Labour's techno-optimist, Darren Jones
In the same way as you upgrade your iPhone, we need to upgrade Britain." Labour MP Darren Jones believes artificial intelligence will bring an economic change on the scale of the industrial revolution, which politicians must be ready to shape. As chair of the business and trade select committee, the ambitious 36-year-old backbencher, who represents Bristol North West, has built a reputation for himself in Westminster as a tough interrogator. With speculation raging last week about the future of Thames Water, he took to the airwaves to criticise the way the heavily indebted sector has been regulated, saying he was "increasingly sick" of its failures. However, Jones is at his most animated when talking about AI. He has clashed with company bosses over their use of technology to monitor and control staff – including at Amazon and Royal Mail. But he is an evangelist for the upsides of innovation, including the arrival of large language models (LLMs) such as the hit dialogue-based AI software ChatGPT. "It's really important that we flip this debate.
Out-of-distributional risk bounds for neural operators with applications to the Helmholtz equation
Benitez, J. Antonio Lara, Furuya, Takashi, Faucher, Florian, Kratsios, Anastasis, Tricoche, Xavier, de Hoop, Maarten V.
Despite their remarkable success in approximating a wide range of operators defined by PDEs, existing neural operators (NOs) do not necessarily perform well for all physics problems. We focus here on high-frequency waves to highlight possible shortcomings. To resolve these, we propose a subfamily of NOs enabling an enhanced empirical approximation of the nonlinear operator mapping wave speed to solution, or boundary values for the Helmholtz equation on a bounded domain. The latter operator is commonly referred to as the ''forward'' operator in the study of inverse problems. Our methodology draws inspiration from transformers and techniques such as stochastic depth. Our experiments reveal certain surprises in the generalization and the relevance of introducing stochastic depth. Our NOs show superior performance as compared with standard NOs, not only for testing within the training distribution but also for out-of-distribution scenarios. To delve into this observation, we offer an in-depth analysis of the Rademacher complexity associated with our modified models and prove an upper bound tied to their stochastic depth that existing NOs do not satisfy. Furthermore, we obtain a novel out-of-distribution risk bound tailored to Gaussian measures on Banach spaces, again relating stochastic depth with the bound. We conclude by proposing a hypernetwork version of the subfamily of NOs as a surrogate model for the mentioned forward operator.