Neiswanger, Willie
Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution
Wu, Tailin, Neiswanger, Willie, Zheng, Hongtao, Ermon, Stefano, Leskovec, Jure
Deep learning-based surrogate models have demonstrated remarkable advantages over classical solvers in terms of speed, often achieving speedups of 10 to 1000 times over traditional partial differential equation (PDE) solvers. However, a significant challenge hindering their widespread adoption in both scientific and industrial domains is the lack of understanding about their prediction uncertainties, particularly in scenarios that involve critical decision making. To address this limitation, we propose a method that integrates efficient and precise uncertainty quantification into a deep learning-based surrogate model. Our method, termed Latent Evolution of PDEs with Uncertainty Quantification (LE-PDE-UQ), endows deep learning-based surrogate models with robust and efficient uncertainty quantification capabilities for both forward and inverse problems. LE-PDE-UQ leverages latent vectors within a latent space to evolve both the system's state and its corresponding uncertainty estimation. The latent vectors are decoded to provide predictions for the system's state as well as estimates of its uncertainty. In extensive experiments, we demonstrate the accurate uncertainty quantification performance of our approach, surpassing that of strong baselines including deep ensembles, Bayesian neural network layers, and dropout. Our method excels at propagating uncertainty over extended auto-regressive rollouts, making it suitable for scenarios involving long-term predictions. Our code is available at: https://github.com/AI4Science-WestlakeU/le-pde-uq.
DeLLMa: A Framework for Decision Making Under Uncertainty with Large Language Models
Liu, Ollie, Fu, Deqing, Yogatama, Dani, Neiswanger, Willie
Large language models (LLMs) are increasingly used across society, including in domains like business, engineering, and medicine. These fields often grapple with decision-making under uncertainty, a critical yet challenging task. In this paper, we show that directly prompting LLMs on these types of decision-making problems yields poor results, especially as the problem complexity increases. To overcome this limitation, we propose DeLLMa (Decision-making Large Language Model assistant), a framework designed to enhance decision-making accuracy in uncertain environments. DeLLMa involves a multi-step scaffolding procedure, drawing upon principles from decision theory and utility theory, to provide an optimal and human-auditable decision-making process. We validate our framework on decision-making environments involving real agriculture and finance data. Our results show that DeLLMa can significantly improve LLM decision-making performance, achieving up to a 40% increase in accuracy over competing methods.
Multipoint-BAX: A New Approach for Efficiently Tuning Particle Accelerator Emittance via Virtual Objectives
Miskovich, Sara A., Neiswanger, Willie, Colocho, William, Emma, Claudio, Garrahan, Jacqueline, Maxwell, Timothy, Mayes, Christopher, Ermon, Stefano, Edelen, Auralee, Ratner, Daniel
Although beam emittance is critical for the performance of high-brightness accelerators, optimization is often time limited as emittance calculations, commonly done via quadrupole scans, are typically slow. Such calculations are a type of $\textit{multipoint query}$, i.e. each query requires multiple secondary measurements. Traditional black-box optimizers such as Bayesian optimization are slow and inefficient when dealing with such objectives as they must acquire the full series of measurements, but return only the emittance, with each query. We propose a new information-theoretic algorithm, Multipoint-BAX, for black-box optimization on multipoint queries, which queries and models individual beam-size measurements using techniques from Bayesian Algorithm Execution (BAX). Our method avoids the slow multipoint query on the accelerator by acquiring points through a $\textit{virtual objective}$, i.e. calculating the emittance objective from a fast learned model rather than directly from the accelerator. We use Multipoint-BAX to minimize emittance at the Linac Coherent Light Source (LCLS) and the Facility for Advanced Accelerator Experimental Tests II (FACET-II). In simulation, our method is 20$\times$ faster and more robust to noise compared to existing methods. In live tests, it matched the hand-tuned emittance at FACET-II and achieved a 24% lower emittance than hand-tuning at LCLS. Our method represents a conceptual shift for optimizing multipoint queries, and we anticipate that it can be readily adapted to similar problems in particle accelerators and other scientific instruments.
LLM360: Towards Fully Transparent Open-Source LLMs
Liu, Zhengzhong, Qiao, Aurick, Neiswanger, Willie, Wang, Hongyi, Tan, Bowen, Tao, Tianhua, Li, Junbo, Wang, Yuqi, Sun, Suqi, Pangarkar, Omkar, Fan, Richard, Gu, Yi, Miller, Victor, Zhuang, Yonghao, He, Guowei, Li, Haonan, Koto, Fajri, Tang, Liping, Ranjan, Nikhil, Shen, Zhiqiang, Ren, Xuguang, Iriondo, Roberto, Mu, Cun, Hu, Zhiting, Schulze, Mark, Nakov, Preslav, Baldwin, Tim, Xing, Eric P.
The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.llm360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.
Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration
Mehta, Viraj, Das, Vikramjeet, Neopane, Ojash, Dai, Yijia, Bogunovic, Ilija, Schneider, Jeff, Neiswanger, Willie
Preference-based feedback is important for many applications in reinforcement learning where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback (RLHF) on large language models. For many applications of RLHF, the cost of acquiring the human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and formalize this as an offline contextual dueling bandit problem. We give an upper-confidence-bound style algorithm for this problem and prove a polynomial worst-case regret bound. We then provide empirical confirmation in a synthetic setting that our approach outperforms existing methods. After, we extend the setting and methodology for practical use in RLHF training of large language models. Here, our method is able to reach better performance with fewer samples of human preferences than multiple baselines on three real-world datasets.
Making Scalable Meta Learning Practical
Choe, Sang Keun, Mehta, Sanket Vaibhav, Ahn, Hwijeen, Neiswanger, Willie, Xie, Pengtao, Strubell, Emma, Xing, Eric
Despite its flexibility to learn diverse inductive biases in machine learning programs, meta learning (i.e., learning to learn) has long been recognized to suffer from poor scalability due to its tremendous compute/memory costs, training instability, and a lack of efficient distributed training support. In this work, we focus on making scalable meta learning practical by introducing SAMA, which combines advances in both implicit differentiation algorithms and systems. Specifically, SAMA is designed to flexibly support a broad range of adaptive optimizers in the base level of meta learning programs, while reducing computational burden by avoiding explicit computation of second-order gradient information, and exploiting efficient distributed training techniques implemented for first-order gradients. Evaluated on multiple large-scale meta learning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and 2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU setups compared to other baseline meta learning algorithms. Furthermore, we show that SAMA-based data optimization leads to consistent improvements in text classification accuracy with BERT and RoBERTa large language models, and achieves state-of-the-art results in both small- and large-scale data pruning on image classification tasks, demonstrating the practical applicability of scalable meta learning across language and vision domains.
SlimPajama-DC: Understanding Data Combinations for LLM Training
Shen, Zhiqiang, Tao, Tianhua, Ma, Liqun, Neiswanger, Willie, Liu, Zhengzhong, Wang, Hongyi, Tan, Bowen, Hestness, Joel, Vassilieva, Natalia, Soboleva, Daria, Xing, Eric
This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.
Kernelized Offline Contextual Dueling Bandits
Mehta, Viraj, Neopane, Ojash, Das, Vikramjeet, Lin, Sen, Schneider, Jeff, Neiswanger, Willie
Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.
Betty: An Automatic Differentiation Library for Multilevel Optimization
Choe, Sang Keun, Neiswanger, Willie, Xie, Pengtao, Xing, Eric
Gradient-based multilevel optimization (MLO) has gained attention as a framework for studying numerous problems, ranging from hyperparameter optimization and meta-learning to neural architecture search and reinforcement learning. However, gradients in MLO, which are obtained by composing best-response Jacobians via the chain rule, are notoriously difficult to implement and memory/compute intensive. Multilevel optimization (MLO) addresses nested optimization scenarios, where upper level optimization problems are constrained by lower level optimization problems following an underlying hierarchical dependency. MLO has gained considerable attention as a unified mathematical framework for studying diverse problems including meta-learning (Finn et al., 2017; Rajeswaran et al., 2019), hyperparameter optimization (Franceschi et al., 2017), neural architecture search (Liu et al., 2019), and reinforcement learning (Konda & Tsitsiklis, 1999; Rajeswaran et al., 2020). While a majority of existing work is built upon bilevel optimization, the simplest case of MLO, there have been recent efforts that go beyond this two-level hierarchy. For example, (Raghu et al., 2021) proposed trilevel optimization that combines hyperparameter optimization with two-level pretraining and finetuning. More generally, conducting joint optimization over machine learning pipelines consisting of multiple models and hyperparameter sets can be approached as deeper instances of MLO (Garg et al., 2022; Raghu et al., 2021; Somayajula et al., 2022; Such et al., 2020). Following its increasing popularity, a multitude of optimization algorithms have been proposed to solve MLO. Among them, gradient-based (or first-order) approaches (Pearlmutter & Siskind, 2008; Lorraine et al., 2020; Raghu et al., 2021; Sato et al., 2021) have recently received the limelight from the machine learning community, due to their ability to carry out efficient high-dimensional optimization, under which all of the above listed applications fall. Nevertheless, research in gradientbased MLO has been largely impeded by two major bottlenecks.
Generative Modeling Helps Weak Supervision (and Vice Versa)
Boecking, Benedikt, Roberts, Nicholas, Neiswanger, Willie, Ermon, Stefano, Sala, Frederic, Dubrawski, Artur
Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples. How can we get the most out of data when we do not have ground truth labels? Two prominent paradigms operate in this setting. First, programmatic weak supervision frameworks use weak sources of training signal to train downstream supervised models, without needing access to groundtruth labels (Riedel et al., 2010; Ratner et al., 2016; Dehghani et al., 2017; Lang & Poon, 2021). Second, generative models enable learning data distributions which can benefit downstream tasks, e.g. Intuitively, these two paradigms should complement each other, as each can be thought of as a different approach to extracting structure from unlabeled data. However, to date there is no simple way to combine them. Fusing generative models with weak supervision holds substantial promise. For example, it could yield large reductions in data acquisition costs for training complex models. Programmatic weak supervision replaces the need for manual annotations by applying so-called labeling functions to unlabeled data, producing weak labels that are combined into a pseudolabel for each sample.