Plotting

 Chu, Hong-Min


NEFTune: Noisy Embeddings Improve Instruction Finetuning

arXiv.org Artificial Intelligence

We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune. The ability of LLMs to follow detailed instructions is vital to their usefulness. Generative language models are typically trained on raw web data, and then subsequently fine-tuned on a comparatively small but carefully curated set of instruction data. Instruction fine-tuning is crucial to taming the power of LLMs, and the usefulness of a model is largely determined by our ability to get the most out of small instruction datasets. In this paper, we propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning. We show that this simple trick can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead. Noisy Embedding Instruction Fine Tuning (NEFTune), while simple, has a strong impact on downstream conversational quality. When a raw LLM like LLaMA-2-7B is finetuned with noisy embeddings, its performance on AlpacaEval improves from 29.8% to 64.7% (Figure 1) - an impressive boost of around 35 percentage points (Touvron et al., 2023b; Dubois et al., 2023). NEFTune leads to this surprising and large jump in performance on conversational tasks, maintaining performance on factual question answering baselines. This technique seems to be a free lunch for LLM fine-tuning. NEFTune leads to massive performance boosts across all of these datasets, showcasing the increased conversational quality of the generated answers. The earliest forms of instruction finetuning such as FLAN and T0 (Sanh et al., 2021; Wei et al., 2021) focused on cross-task generalization in language models. Encoder-decoder language models were finetuned on a broad range of NLP tasks (about 100) and then evaluated on a set of different tasks. This was later scaled up to include thousands of tasks, seeing further improvement over the original FLAN (Chung et al., 2022; Xu et al., 2022). Although these works showed that LLMs could be easily adapted to solve simple and classical NLP tasks, real-world scenarios require LLMs to provide free-form answers to open-ended queries. InstructGPT (Ouyang et al., 2022) was the first model to tackle open-ended queries with impressive performance. OpenAI further trained GPT-3 (Brown et al., 2020) using reinforcement learning from human feedback (RLHF) to align the model.


Universal Guidance for Diffusion Models

arXiv.org Artificial Intelligence

Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals.


Active Learning at the ImageNet Scale

arXiv.org Artificial Intelligence

Active learning (AL) algorithms aim to identify an optimal subset of data for annotation, such that deep neural networks (DNN) can achieve better performance when trained on this labeled subset. AL is especially impactful in industrial scale settings where data labeling costs are high and practitioners use every tool at their disposal to improve model performance. The recent success of self-supervised pretraining (SSP) highlights the importance of harnessing abundant unlabeled data to boost model performance. By combining AL with SSP, we can make use of unlabeled data while simultaneously labeling and training on particularly informative samples. In this work, we study a combination of AL and SSP on ImageNet. We find that performance on small toy datasets -- the typical benchmark setting in the literature -- is not representative of performance on ImageNet due to the class imbalanced samples selected by an active learner. Among the existing baselines we test, popular AL algorithms across a variety of small and large scale settings fail to outperform random sampling. To remedy the class-imbalance problem, we propose Balanced Selection (BASE), a simple, scalable AL algorithm that outperforms random sampling consistently by selecting more balanced samples for annotation than existing methods. Our code is available at: https://github.com/zeyademam/active_learning .


Deep Learning with a Rethinking Structure for Multi-label Classification

arXiv.org Machine Learning

Department of Computer Science and Information Engineering, National Taiwan University Abstract Multi-label classification (MLC) is an important class of machine learning problems that come with a wide spectrum of applications, each demanding a possibly different evaluation criterion. When solving the MLC problems, we generally expect the learning algorithm to take the hidden correlation of the labels into account to improve the prediction performance. Extracting the hidden correlation is generally a challenging task. In this work, we propose a novel deep learning framework to better extract the hidden correlation with the help of the memory structure within recurrent neural networks. The memory stores the temporary guesses on the labels and effectively allows the framework to rethink about the goodness and correlation of the guesses before making the final prediction. Furthermore, the rethinking process makes it easy to adapt to different evaluation criteria to match real-world application needs. In particular, the framework can be trained in an end-to-end style with respect to any given MLC evaluation criteria. The end-to-end design can be seamlessly combined with other deep learning techniques to conquer challenging MLC problems like image tagging. Experimental results across many real-world data sets justify that the rethinking framework indeed improves MLC performance across different evaluation criteria and leads to superior performance over state-of-the-art MLC algorithms. Keywords: multi-label, deep learning, cost-sensitive 1. Introduction Human beings master our skills for a given problem by working on and thinking through the same problem over and over again. When a difficult problem is given to us, multiple attempts would have gone through our mind to simulate different possibilities. During this period, our understanding to the problem gets deeper, which in term allows us to propose a better solution in the end. The deeper understanding comes from a piece of consolidated knowledge within our memory, which records how we build up the problem context with processing and predicting during the "rethinking" attempts.


Scheduling in Visual Fog Computing: NP-Completeness and Practical Efficient Solutions

AAAI Conferences

The visual fog paradigm envisions tens of thousands of heterogeneous, camera-enabled edge devices distributed across the Internet, providing live sensing for a myriad of different visual processing applications. The scale, computational demands, and bandwidth needed for visual computing pipelines necessitates offloading intelligently to distributed computing infrastructure, including the cloud, Internet gateway devices, and the edge devices themselves. This paper focuses on the visual fog scheduling problem of assigning the visual computing tasks to various devices to optimize network utilization. We first prove this problem is NP-complete, and then formulate a practical, efficient solution. We demonstrate sub-minute computation time to optimally schedule 20,000 tasks across over 7,000 devices, and just 7-minute execution time to place 60,000 tasks across 20,000 devices, showing our approach is ready to meet the scale challenges introduced by visual fog.