Goto

Collaborating Authors

 opt-175b


FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

arXiv.org Artificial Intelligence

The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen


SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, and LLaMA family. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.


Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

arXiv.org Artificial Intelligence

Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. Yet, human values can vary under diverse cultural conditions. Therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. Along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (LLMs) to construct value-aligned classifiers in two steps. First, we generate value-aligned training data from LLMs by prompt-based few-shot learning. Next, we fine-tune smaller classification models with the generated data for the task. Empirical results show that our VA-Models surpass multiple baselines by at least 15.56% on the F1-score, including few-shot learning with OPT-175B and existing text augmentation methods. We suggest that using classifiers with explicit human value input improves both inclusivity & explainability in AI.


Meta's BlenderBot 3 wants to chat – but can you trust it?

The Guardian

Last week, researchers at Facebook's parent company Meta released BlenderBot 3, a "publicly available chatbot that improves its skills and safety over time". The chatbot is built on top of Meta's OPT-175B language model, effectively the company's white-label version of the more famous GPT-3 AI. Like most state-of-the-art AIs these days, that was trained on a vast corpus of text scraped from the internet in questionable ways, and poured into a datacentre with thousands of expensive chips that turned the text into something approaching coherence. But where OPT-175B is a general-purpose textbot, able to do anything from write fiction and answer questions to generate spam emails, BlenderBot 3 is a narrower project: it can have a conversation with you. That focus allows it to bring in other expertise, though, and one of Meta's most significant successes is hooking the language model up to the broader internet.


Open source isn't working for AI

#artificialintelligence

Clearly, we need to do something about how we talk about open source and openness in general. It's been clear since at least 2006 when I rightly got smacked down for calling out Google and Yahoo! for holding back on open source. As Tim O'Reilly wrote at the time, in a cloud era of open source, "one of the motivations to share--the necessity of giving a copy of the source in order to let someone run your program--is truly gone." In fact, he went on, "Not only is it no longer required, in the case of the largest applications, it's no longer possible." That impossibility of sharing has roiled the definition of open source during the past decade, and it's now affecting the way we think about artificial intelligence (AI), as Mike Loukides recently noted.


Democratizing access to large-scale language models with OPT-175B

#artificialintelligence

We achieved 147 TFLOP/s/GPU utilization on NVIDIA's 80 GB A100 GPUs, roughly 17 percent higher than published by NVIDIA researchers on similar hardware. By sharing these baselines along with the codebase to train a 175B model efficiently, we have an opportunity to reduce our collective environmental footprint while also allowing new results and progress in the field to be measurable in a consistent manner. For AI research to advance, the broader scientific community must be able to work together with cutting-edge models to effectively explore their potential while also probing for their vulnerabilities at the same time. As with our previous open-science initiatives, such as the Image Similarity Challenge, the Deepfake Detection Challenge, and the Hateful Memes Challenge, Meta AI believes that collaboration across research organizations is critical to the responsible development of AI technologies. While there are many exciting developments in the space of large language models, the limitations and risks these models pose are still not well understood. Without direct access to these models, researchers are also limited in their ability to design detection and mitigation strategies for possible harm, which leaves detection and mitigation in the hands of only those with sufficient capital to access models of this scale. We hope that OPT-175B will bring more voices to the frontier of large language model creation, help the community collectively design responsible release strategies, and add an unprecedented level of transparency and openness to the development of large language models in the field. Access the open source code and small-scale pretrained models here, request access to OPT-175B here, and read the paper here. Pretrained models are all licensed under the OPT-175B License Agreement.


Meta AI Giving Away Its New Large Language Model

#artificialintelligence

AI researchers at Meta have created a massive new language model to rival OpenAI's GPT-3 and advance our understanding of large language models. And it is giving it away as part of its effort to democratize AI. Open Pretrained Transformer (OPT-175B) is a language model with 175 billion parameters trained on publicly available data sets. According to Meta, 992 A100 GPUs equipped with 80GB of onboard memory from Nvidia were used over a training period of two months. To facilitate "community engagement", the release includes both the pre-trained model, extensive notes about its development, logbook detailing the training process, and the code needed to train and use the model.