Rozonoyer, Benjamin
Quasi-random Multi-Sample Inference for Large Language Models
Parashar, Aditya, Singh, Aditya Vikram, Amballa, Avinash, Lai, Jinlin, Rozonoyer, Benjamin
Large language models (LLMs) are often equipped with multi-sample decoding strategies. An LLM implicitly defines an arithmetic code book, facilitating efficient and embarrassingly parallelizable \textbf{arithmetic sampling} to produce multiple samples using quasi-random codes. Traditional text generation methods, such as beam search and sampling-based techniques, have notable limitations: they lack parallelizability or diversity of sampled sequences. This study explores the potential of arithmetic sampling, contrasting it with ancestral sampling across two decoding tasks that employ multi-sample inference: chain-of-thought reasoning with self-consistency and machine translation with minimum Bayes risk decoding. Our results demonstrate that arithmetic sampling produces more diverse samples, significantly improving reasoning and translation performance as the sample size increases. We observe a $\mathbf{3\text{-}5\%}$ point increase in accuracy on the GSM8K dataset and a $\mathbf{0.45\text{-}0.89\%}$ point increment in COMET score for WMT19 tasks using arithmetic sampling without any significant computational overhead.
Multistage Collaborative Knowledge Distillation from Large Language Models for Semi-Supervised Sequence Generation
Zhao, Jiachen, Zhao, Wenlong, Drozdov, Andrew, Rozonoyer, Benjamin, Sultan, Md Arafat, Lee, Jay-Yoon, Iyyer, Mohit, McCallum, Andrew
We study semi-supervised sequence generation tasks where labeled data are too scarce to effectively finetune a model and at the same time few-shot prompting of a large language model (LLM) has suboptimal performance. This happens when a task, such as parsing, is expensive to annotate and also unfamiliar to a pretrained LLM. In this paper, we present a discovery that student models distilled from an in-context learned LLM can often generalize better than their teacher on such tasks. Leveraging this finding, we present a new method -- multistage collaborative knowledge distillation from an LLM (MCKD) -- for such tasks. MCKD first few-shot prompts an LLM to produce pseudolabels for unlabeled data. At each intermediate knowledge distillation (KD) stage, a new pair of students is trained on disjoint partitions of the pseudolabeled data. Each student then produces new and improved pseudolabels for its unseen partition to be used in the next stage of distillation. We demonstrate the advantage of multistage cross-partition labeling on several syntactic and semantic parsing tasks. On CRAFT biomedical parsing, for example, 3-stage MCKD with 50 labeled examples outperforms the prompted LLM and vanilla KD by 7.5% and 3.7% parsing F1, respectively, and matches the performance of supervised finetuning with 500 examples.