Goto

Collaborating Authors

 Cohen, Jonathan


Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

arXiv.org Artificial Intelligence

Many recent studies have found evidence for emergent reasoning capabilities in large language models, but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we perform a comprehensive study of the internal mechanisms that support abstract rule induction in an open-source language model (Llama3-70B). We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, symbol abstraction heads convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, symbolic induction heads perform sequence induction over these abstract variables. Finally, in later layers, retrieval heads predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.


Nemotron-4 340B Technical Report

arXiv.org Artificial Intelligence

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.


Nemotron-4 15B Technical Report

arXiv.org Artificial Intelligence

For example, (Hoffmann et al., 2022) shows that given two roughly IsoFLOP GPT models with a similar data distribution, a 65-billion-parameter model on 1.4 trillion tokens and a 280-billion-parameter model on 300 billion tokens, the 65B model has better accuracy on downstream tasks. This trade-off of allocating compute towards training on more data as opposed to increasing model size is particularly appealing from an inference perspective, reducing latency and the amount of compute needed to serve models. As a consequence, a major focus of language modeling training efforts has shifted to collecting high-quality multi-trillion token datasets from public sources such as Common Crawl.


NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

arXiv.org Artificial Intelligence

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or rails for short) are a specific way of controlling the output of an LLM, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. There are several mechanisms that allow LLM providers and developers to add guardrails that are embedded into a specific model at training, e.g. using model alignment. Differently, using a runtime inspired from dialogue management, NeMo Guardrails allows developers to add programmable rails to LLM applications - these are user-defined, independent of the underlying LLM, and interpretable. Our initial results show that the proposed approach can be used with several LLM providers to develop controllable and safe LLM applications using programmable rails.


Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers

arXiv.org Machine Learning

October 6, 2023 Abstract: An extension of Transformers is proposed that enables explicit relational reasoning through a novel module called the Abstractor. At the core of the Abstractor is a variant of attention called relational cross-attention. The approach is motivated by an architectural inductive bias for relational learning that disentangles relational information from extraneous features about individual objects. This enables explicit relational reasoning, supporting abstraction and generalization from limited data. The Abstractor is first evaluated on simple discriminative relational tasks and compared to existing relational architectures. Next, the Abstractor is evaluated on purely relational sequence-to-sequence tasks, where dramatic improvements are seen in sample efficiency compared to standard Transformers. Finally, Abstractors are evaluated on a collection of tasks based on mathematical problem solving, where modest but consistent improvements in performance and sample efficiency are observed.


Beyond Transformers for Function Learning

arXiv.org Artificial Intelligence

The ability to learn and predict simple functions is a key aspect of human intelligence. Recent works have started to explore this ability using transformer architectures, however it remains unclear whether this is sufficient to recapitulate the extrapolation abilities of people in this domain. Here, we propose to address this gap by augmenting the transformer architecture with two simple inductive learning biases, that are directly adapted from recent models of abstract reasoning in cognitive science. The results we report demonstrate that these biases are helpful in the context of large neural network models, as well as shed light on the types of inductive learning biases that may contribute to human abilities in extrapolation.


Modelling the development of counting with memory-augmented neural networks

arXiv.org Artificial Intelligence

Learning to count is an important example of the broader human capacity for systematic generalization, and the development of counting is often characterized by an inflection point when children rapidly acquire proficiency with the procedures that support this ability. We aimed to model this process by training a reinforcement learning agent to select N items from a binary vector when instructed (known as the give-$N$ task). We found that a memory-augmented modular network architecture based on the recently proposed Emergent Symbol Binding Network (ESBN) exhibited an inflection during learning that resembled human development. This model was also capable of systematic extrapolation outside the range of its training set - for example, trained only to select between 1 and 10 items, it could succeed at selecting 11 to 15 items as long as it could make use of an arbitrary count sequence of at least that length. The close parallels to child development and the capacity for extrapolation suggest that our model could shed light on the emergence of systematicity in humans.


Learning Canonical Transformations

arXiv.org Artificial Intelligence

Humans understand a set of canonical geometric transformations (such as translation and rotation) that support generalization by being untethered to any specific object. We explore inductive biases that help a neural network model learn these transformations in pixel space in a way that can generalize out-of-domain. Specifically, we find that high training set diversity is sufficient for the extrapolation of translation to unseen shapes and scales, and that an iterative training scheme achieves significant extrapolation of rotation in time.


Thyroid Cancer Malignancy Prediction From Whole Slide Cytopathology Images

arXiv.org Artificial Intelligence

We consider preoperative prediction of thyroid cancer based on ultra-high-resolution whole-slide cytopathology images. Inspired by how human experts perform diagnosis, our approach first identifies and classifies diagnostic image regions containing informative thyroid cells, which only comprise a tiny fraction of the entire image. These local estimates are then aggregated into a single prediction of thyroid malignancy. Several unique characteristics of thyroid cytopathology guide our deep-learning-based approach. While our method is closely related to multiple-instance learning, it deviates from these methods by using a supervised procedure to extract diagnostically relevant regions. Moreover, we propose to simultaneously predict thyroid malignancy, as well as a diagnostic score assigned by a human expert, which further allows us to devise an improved training strategy. Experimental results show that the proposed algorithm achieves performance comparable to human experts, and demonstrate the potential of using the algorithm for screening and as an assistive tool for the improved diagnosis of indeterminate cases.