Lan, Janice
LLM Pretraining with Continuous Concepts
Tack, Jihoon, Lanchantin, Jack, Yu, Jane, Cohen, Andrew, Kulikov, Ilia, Lan, Janice, Hao, Shibo, Tian, Yuandong, Weston, Jason, Li, Xian
Recent progress in large language models (LLMs) has revolutionized natural language processing (Brown et al., 2020; Dubey et al., 2024) and thus became a core technology in various real-world applications, such as coding assistants (Roziere et al., 2023), search engines (Xuan-Quy et al., 2023), and personal AI assistants (Gao et al., 2023). Central to these breakthroughs is the simple paradigm of next token prediction, which leverages massive amounts of unlabeled text to uncover rich linguistic patterns (Radford et al., 2018, 2019). However, natural language tokens are often superficial (e.g., function words like "the" or "a"), necessitating substantial training for models to acquire high-level reasoning and conceptual understanding while also hindering their ability to tackle long-horizon tasks such as planning (LeCun, 2022; Bachmann and Nagarajan, 2024). To tackle this issue, recent studies have investigated methods that go beyond token-level signals by leveraging richer information to train models. For instance, some approaches target more expressive prediction objectives, such as predicting multiple tokens at once to better capture semantic relationships (Gloeckle et al., 2024; DeepSeek-AI, 2024), while others augment the input with rich signals, e.g., self-generated thought tokens (Zelikman et al., 2024), or fixed pause tokens (Goyal et al., 2024) prior to next token prediction. Moreover, emerging evidence suggests that LLMs inherently encode high-level concepts and reasoning processes in their latent representations (Deng et al., 2023; Yang et al., 2024), indicating replacing discrete language tokens with continuous latent representations has promise in improving reasoning efficiency (Hao et al., 2024). While token-level modeling remains important for coherent text generation, the key challenge is to enrich or supplement these natural language tokens so that LLMs can learn more abstract reasoning abilities and long-range dependencies. This raises a key question: can we augment the next token prediction objective to explicitly model concepts in a latent representation space, thereby bridging semantic abstraction and fine-grained token-level guidance? To this end, we draw inspiration from recent findings that Sparse Autoencoders (SAEs) can effectively isolate meaningful latent features in LLMs by capturing the high-level semantic concepts (Cunningham et al., 2023;
Thinking LLMs: General Instruction Following with Thought Generation
Wu, Tianhao, Lan, Janice, Yuan, Weizhe, Jiao, Jiantao, Weston, Jason, Sukhbaatar, Sainbayar
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning - but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks. Large Language Models (LLMs) are based on the Transformer architecture (Vaswani et al., 2017) that predicts the next token at each step. Each token takes the same amount of compute, so when LLMs are prompted with a user instruction, they have a fixed compute budget to generate the first response token regardless of the instruction's complexity. One way to increase the compute budget for harder instructions is to allow LLMs to think internally before outputting an response. This is similar to humans who will take more time and think before answering complex questions. One approach is to generate thoughts as text, which takes advantage of the natural language capabilities of LLMs. LLMs are pre-trained on text containing human-written thoughts, which are hence encoded into the model. Chain-of-Thought (CoT) (Wei et al., 2022) is a widely used prompting technique that elicits such behavior by asking the model to write down its reasoning steps. However, the usage of CoT has been mostly limited to math and reasoning tasks. Meta-analysis by Sprague et al. (2024) found CoT methods to be unhelpful on tasks that do not involve math and logic. In this paper, we focus on general instruction following instead of focusing on math or logic tasks. We argue that "thinking" should have broad utility. For example, in a creative writing task, internal thoughts can be used to plan overall structure and characters. In other tasks, internal thoughts can be used for understanding the user instruction better. Of course, it is likely that less thinking is required for simpler tasks, and more thinking for more complex ones. In general, we hypothesize that such Thinking LLMs will have an advantage on all sufficiently complex tasks.
Rotationally Invariant Latent Distances for Uncertainty Estimation of Relaxed Energy Predictions by Graph Neural Network Potentials
Musielewicz, Joseph, Lan, Janice, Uyttendaele, Matt, Kitchin, John R.
Graph neural networks (GNNs) have been shown to be astonishingly capable models for molecular property prediction, particularly as surrogates for expensive density functional theory calculations of relaxed energy for novel material discovery. However, one limitation of GNNs in this context is the lack of useful uncertainty prediction methods, as this is critical to the material discovery pipeline. In this work, we show that uncertainty quantification for relaxed energy calculations is more complex than uncertainty quantification for other kinds of molecular property prediction, due to the effect that structure optimizations have on the error distribution. We propose that distribution-free techniques are more useful tools for assessing calibration, recalibrating, and developing uncertainty prediction methods for GNNs performing relaxed energy calculations. We also develop a relaxed energy task for evaluating uncertainty methods for equivariant GNNs, based on distribution-free recalibration and using the Open Catalyst Project dataset. We benchmark a set of popular uncertainty prediction methods on this task, and show that latent distance methods, with our novel improvements, are the most well-calibrated and economical approach for relaxed energy calculations. Finally, we demonstrate that our latent space distance method produces results which align with our expectations on a clustering example, and on specific equation of state and adsorbate coverage examples from outside the training dataset.
AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning Potentials
Lan, Janice, Palizhati, Aini, Shuaibi, Muhammed, Wood, Brandon M., Wander, Brook, Das, Abhishek, Uyttendaele, Matt, Zitnick, C. Lawrence, Ulissi, Zachary W.
Computational catalysis is playing an increasingly significant role in the design of catalysts across a wide range of applications. A common task for many computational methods is the need to accurately compute the adsorption energy for an adsorbate and a catalyst surface of interest. Traditionally, the identification of low energy adsorbate-surface configurations relies on heuristic methods and researcher intuition. As the desire to perform high-throughput screening increases, it becomes challenging to use heuristics and intuition alone. In this paper, we demonstrate machine learning potentials can be leveraged to identify low energy adsorbate-surface configurations more accurately and efficiently. Our algorithm provides a spectrum of trade-offs between accuracy and efficiency, with one balanced option finding the lowest energy configuration 87.36% of the time, while achieving a 2000x speedup in computation. To standardize benchmarking, we introduce the Open Catalyst Dense dataset containing nearly 1,000 diverse surfaces and 100,000 unique configurations.
The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts
Tran, Richard, Lan, Janice, Shuaibi, Muhammed, Wood, Brandon M., Goyal, Siddharth, Das, Abhishek, Heras-Domingo, Javier, Kolluru, Adeesh, Rizvi, Ammar, Shoghi, Nima, Sriram, Anuroop, Therrien, Felix, Abed, Jehad, Voznyy, Oleksandr, Sargent, Edward H., Ulissi, Zachary, Zitnick, C. Lawrence
The development of machine learning models for electrocatalysts requires a broad set of training data to enable their use across a wide variety of materials. One class of materials that currently lacks sufficient training data is oxides, which are critical for the development of OER catalysts. To address this, we developed the OC22 dataset, consisting of 62,331 DFT relaxations (~9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates. We define generalized total energy tasks that enable property prediction beyond adsorption energies; we test baseline performance of several graph neural networks; and we provide pre-defined dataset splits to establish clear benchmarks for future efforts. In the most general task, GemNet-OC sees a ~36% improvement in energy predictions when combining the chemically dissimilar OC20 and OC22 datasets via fine-tuning. Similarly, we achieved a ~19% improvement in total energy predictions on OC20 and a ~9% improvement in force predictions in OC22 when using joint training. We demonstrate the practical utility of a top performing model by capturing literature adsorption energies and important OER scaling relationships. We expect OC22 to provide an important benchmark for models seeking to incorporate intricate long-range electrostatic and magnetic interactions in oxide surfaces. Dataset and baseline models are open sourced, and a public leaderboard is available to encourage continued community developments on the total energy tasks and data.
First-Order Preconditioning via Hypergradient Descent
Moskovitz, Ted, Wang, Rui, Lan, Janice, Kapoor, Sanyam, Miconi, Thomas, Yosinski, Jason, Rawal, Aditya
A BSTRACT Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space. These difficulties can be addressed by second-order approaches that apply a preconditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher information matrix is highly expensive. We introduce first-order preconditioning (FOP), a fast, scalable approach that generalizes previous work on hyper-gradient descent (Almeida et al., 1998; Maclaurin et al., 2015; Baydin et al., 2017) to learn a preconditioning matrix that only makes use of first-order information. Experiments show that FOP is able to improve the performance of standard deep learning optimizers on several visual classification tasks with minimal computational overhead. We also investigate the properties of the learned preconditioning matrices and perform a preliminary theoretical analysis of the algorithm. Despite this, deep neural networks and other large-scale machine learning models applied to such problems typically rely on simple variations of gradient descent to train, which is known to be highly sensitive to these difficulties.
LCA: Loss Change Allocation for Neural Network Training
Lan, Janice, Liu, Rosanne, Zhou, Hattie, Yosinski, Jason
Neural networks enjoy widespread use, but many aspects of their training, representation, and operation are poorly understood. In particular, our view into the training process is limited, with a single scalar loss being the most common viewport into this high-dimensional, dynamic process. We propose a new window into training called Loss Change Allocation (LCA), in which credit for changes to the network loss is conservatively partitioned to the parameters. This measurement is accomplished by decomposing the components of an approximate path integral along the training trajectory using a Runge-Kutta integrator. This rich view shows which parameters are responsible for decreasing or increasing the loss during training, or which parameters "help" or "hurt" the network's learning, respectively. LCA may be summed over training iterations and/or over neurons, channels, or layers for increasingly coarse views. This new measurement device produces several insights into training. (1) We find that barely over 50% of parameters help during any given iteration. (2) Some entire layers hurt overall, moving on average against the training gradient, a phenomenon we hypothesize may be due to phase lag in an oscillatory training process. (3) Finally, increments in learning proceed in a synchronized manner across layers, often peaking on identical iterations.
Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask
Zhou, Hattie, Lan, Janice, Liu, Rosanne, Yosinski, Jason
The recent "Lottery Ticket Hypothesis" paper by Frankle & Carbin showed that a simple approach to creating sparse networks (keep the large weights) results in models that are trainable from scratch, but only when starting from the same initial weights. The performance of these networks often exceeds the performance of the non-sparse base model, but for reasons that were not well understood. In this paper we study the three critical components of the Lottery Ticket (LT) algorithm, showing that each may be varied significantly without impacting the overall results. Ablating these factors leads to new insights for why LT networks perform as well as they do. We show why setting weights to zero is important, how signs are all you need to make the re-initialized network train, and why masking behaves like training. Finally, we discover the existence of Supermasks, or masks that can be applied to an untrained, randomly initialized network to produce a model with performance far better than chance (86% on MNIST, 41% on CIFAR-10).