Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning
Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating additional layers that perform the necessary data-filtering for CoT via the attention mechanism. In addition to these test-time benefits, we show CoT helps accelerate pretraining by learning shortcuts to represent complex functions and filtering plays an important role in this process. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.
0668e20b3c9e9185b04b3d2a9dc8fa2d-AuthorFeedback.pdf
First of all, we thank all reviewers for their valuable time and feedback. We thank the reviewers for pointing out typos and grammatical errors, which we of course have fixed now. R1: We are afraid that the reviewer might have misunderstood some parts of the paper. We refer to the original paper for further details about the approximation of the variational posterior. We have made this more clear in the main paper now.
Supplementary Material for " Lattice partition recovery with dyadic CART "
In this document, we provide some further technical details and all the proofs of the results in "Lattice We have repeatedly used a concept that two rectangles are adjacent. This section contains the proofs of the main results from Section 2. Theorem 1 demonstrates the one-sided consistency of DCART. The proof of (15) is identical to that of Theorem S1 with one difference. Assumption 1 leads to a contradiction. We are using Fano's method in this proof.
No " Zero-Shot " Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Web-crawled datasets underlie the impressive "zero-shot" performance of multimodal models, such as CLIP for classification and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such models because the extent to which their pretraining datasets encompass downstream concepts used in "zero-shot" evaluation is unknown. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and 5 standard pretraining datasets, generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and evaluation datasets [81], and testing on purely synthetic data distributions [52]. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test dataset as the Let it Wag! benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training data and compute paradigms remains to be found.
A Missing preliminaries
A mechanism R is randomized if for every reported valuations b it outputs a randomized allocation, i.e. it returns integral allocations that are drawn from a probability distribution corresponding to a randomized allocation. Since every randomized allocation has an associated expected fractional allocation, the output of a randomized mechanism for reported valuations b can also be interpreted as representing a fractional allocation. Notice that for randomized mechanisms, the definition of NOM takes an expectation over the randomness of the mechanism, and minimum/maximum are over the reports of other agents; we sometimes write "NOM in expectation" when referring specifically to a randomized mechanism. The PS-Lottery algorithm is based on the well-known probabilistic serial algorithm, which outputs fractional allocations that are envy-free. On a high level, the PS-Lottery algorithm uses Birkhoff's algorithm For the sake of completeness, the PS-Lottery algorithm is formally described in Appendix B.1.
Robust Gaussian Processes via Relevance Pursuit Sebastian Ament Elizabeth Santorella David Eriksson Meta
Gaussian processes (GPs) are non-parametric probabilistic regression models that are popular due to their flexibility, data efficiency, and well-calibrated uncertainty estimates. However, standard GP models assume homoskedastic Gaussian noise, while many real-world applications are subject to non-Gaussian corruptions. Variants of GPs that are more robust to alternative noise models have been proposed, and entail significant trade-offs between accuracy and robustness, and between computational requirements and theoretical guarantees. In this work, we propose and study a GP model that achieves robustness against sparse outliers by inferring data-point-specific noise levels with a sequential selection procedure maximizing the log marginal likelihood that we refer to as relevance pursuit. We show, surprisingly, that the model can be parameterized such that the associated log marginal likelihood is strongly concave in the data-point-specific noise variances, a property rarely found in either robust regression objectives or GP marginal likelihoods. This in turn implies the weak submodularity of the corresponding subset selection problem, and thereby proves approximation guarantees for the proposed algorithm. We compare the model's performance relative to other approaches on diverse regression and Bayesian optimization tasks, including the challenging but common setting of sparse corruptions of the labels within or close to the function range.