Goto

Collaborating Authors

 fishmask


A Compute resources used

Neural Information Processing Systems

Table 3 shows the full results with unlikelihood training and length normalization.COP A H-Swag StoryCloze Winogrande WSC WiC FT 78 .0 PEFT methods we considered and ablate the losses. We use "Question:" and "Answer:" as Since T0 is unable to perform ICL on its own, we also compare to T5+LM, the next-step-prediction language model upon which T0 is based. Due to memory constraints and because of its improved performance, we use ensemble ICL for Table table 10 shows the T-Few ablation results. Per-dataset results of T-Few and the other top-5 methods on RAFT are shown in table 11. 18 # of Param COP A H-Swag StoryCloze WinograndeFull Model Fine-tuning 3B 81 .0


Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Liu, Haokun, Tam, Derek, Muqeeth, Mohammed, Mohta, Jay, Huang, Tenghao, Bansal, Mohit, Raffel, Colin

arXiv.org Artificial Intelligence

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and PEFT and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new PEFT method called (IA)$^3$ that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available.