Goto

Collaborating Authors

 intermediate value




Distinct Computations Emerge From Compositional Curricula in In-Context Learning

arXiv.org Artificial Intelligence

In-context learning (ICL) research often considers learning a function in-context through a uniform sample of input-output pairs. Here, we investigate how presenting a compositional subtask curriculum in context may alter the computations a transformer learns. We design a compositional algorithmic task based on the modular exponential-a double exponential task composed of two single exponential subtasks and train transformer models to learn the task in-context. We compare (a) models trained using an in-context curriculum consisting of single exponential subtasks and, (b) models trained directly on the double exponential task without such a curriculum. We show that models trained with a subtask curriculum can perform zero-shot inference on unseen compositional tasks and are more robust given the same context length. We study how the task and subtasks are represented across the two training regimes. We find that the models employ diverse strategies modulated by the specific curriculum design.


Chain-of-Thought Tokens are Computer Program Variables

arXiv.org Artificial Intelligence

Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables.


MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training

arXiv.org Artificial Intelligence

We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations due to our careful memory optimizations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks.


Profiling checkpointing schedules in adjoint ST-AD

arXiv.org Artificial Intelligence

Section 4 discusses the information Source-transformation algorithmic differentiation (STthat we found most appropriate to guide the choice of AD) in its adjoint mode transforms a primal code that activated checkpoints, and an algorithm to extract this evaluates some original function into an adjoint code information at run-time by profiling execution of the adjoint that computes its gradient. It is well known [9] that code. Section 5 discusses implementation of this the most efficient implementation of the adjoint code profiling in an existing source-transformation AD tool, must progress backwards of the original computation, and section 6 applies it to two realistic test-cases taken progressively using values originating from the primal from the MITgcm code suite. We will show how the execution. The amount of values used grows linearly developer can achieve a significant performance gain by with the run time of the primal code and, since they are exploiting the profiling results. In section 7, we come used in the reverse of their production order, their management back to some limitations of our proposed approach and (data-flow reversal) is a key issue that requires discuss how they could be overcome, before concluding a delicate trade-off between storage and recomputation. in section 8. This work focuses on one particular setting, where data-flow reversal is primarily done through a stack 2 Our checkpointing model / setting and the memory cost of this stack is mitigated through In our setting, data-flow reversal is achieved by storing a classical storage/recomputation trade-off known as intermediate values of the primal execution.


Tracing and Manipulating Intermediate Values in Neural Math Problem Solvers

arXiv.org Artificial Intelligence

How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing how a Transformer model processes these inputs by focusing on simple arithmetic problems and their intermediate values. To trace where information about intermediate values is encoded, we measure the correlation between intermediate values and the activations of the model using principal component analysis (PCA). Then, we perform a causal intervention by manipulating model weights. This intervention shows that the weights identified via tracing are not merely correlated with intermediate values, but causally related to model predictions. Our findings show that the model has a locality to certain intermediate values, and this is useful for enhancing the interpretability of the models.


Combined Pruning for Nested Cross-Validation to Accelerate Automated Hyperparameter Optimization for Embedded Feature Selection in High-Dimensional Data with Very Small Sample Sizes

arXiv.org Artificial Intelligence

Background: Embedded feature selection in high-dimensional data with very small sample sizes requires optimized hyperparameters for the model building process. For this hyperparameter optimization, nested cross-validation must be applied to avoid a biased performance estimation. The resulting repeated training with high-dimensional data leads to very long computation times. Moreover, it is likely to observe a high variance in the individual performance evaluation metrics caused by outliers in tiny validation sets. Therefore, early stopping applying standard pruning algorithms to save time risks discarding promising hyperparameter sets. Result: To speed up feature selection for high-dimensional data with tiny sample size, we adapt the use of a state-of-the-art asynchronous successive halving pruner. In addition, we combine it with two complementary pruning strategies based on domain or prior knowledge. One pruning strategy immediately stops computing trials with semantically meaningless results for the selected hyperparameter combinations. The other is a new extrapolating threshold pruning strategy suitable for nested-cross-validation with a high variance of performance evaluation metrics. In repeated experiments, our combined pruning strategy keeps all promising trials. At the same time, the calculation time is substantially reduced compared to using a state-of-the-art asynchronous successive halving pruner alone. Up to 81.3\% fewer models were trained achieving the same optimization result. Conclusion: The proposed combined pruning strategy accelerates data analysis or enables deeper searches for hyperparameters within the same computation time. This leads to significant savings in time, money and energy consumption, opening the door to advanced, time-consuming analyses.


The Composability of Intermediate Values in Composable Inductive Programming

arXiv.org Artificial Intelligence

It is believed that mechanisms including intermediate values enable composable inductive programming (CIP) to be used to produce software of any size. We present the results of a study that investigated the relationships between program size, the number of intermediate values and the number of test cases used to specify programs using CIP. In the study 96,000 programs of various sizes were randomly generated, decomposed into fragments and transformed into test cases. The test cases were then used to regenerate new versions of the original programs using Zoea. The results show linear relationships between the number of intermediate values and regenerated program size, and between the number of test cases and regenerated program size within the size range studied. In addition, as program size increases there is increasing scope for trading off the number of test cases against the number of intermediate values and vice versa.


BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration

arXiv.org Machine Learning

Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analysing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. This is a powerful combination because of several emergent properties: First, in bottom-up search, intermediate programs can be executed, providing semantic information to the neural network. Second, given the concrete values from those executions, we can exploit rich features based on recent work on property signatures. Finally, bottom-up search allows the system substantial flexibility in what order to generate the solution, allowing the synthesizer to build up a program from multiple smaller sub-programs. Overall, our empirical evaluation finds that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches. We demonstrate the effectiveness of our technique on a new data set for synthesis of string transformation programs.