Goto

Collaborating Authors

 Gallé, Matthias


If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

arXiv.org Artificial Intelligence

Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging ``generalist'' models trained on many tasks. We explore merging in the context of large (~100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and many suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in a Pareto-optimal models that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.


Commit0: Library Generation from Scratch

arXiv.org Artificial Intelligence

Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests, with the goal of producing an implementation of this API accordingly. The implementation is validated through running these unit tests. Our experiments demonstrate that while current agents can pass some unit tests, none can yet fully reproduce full libraries. Results also show that interactive feedback is quite useful for models to generate code that passes more unit tests, validating the benchmarks that facilitate its use. AI agents have been increasing rapidly in ability, particularly in domains such as problem-solving, math, and coding. Tasks related to software development have been particularly promising areas due to both their clarity of evaluation and economic value. This has motivated the release of several coding benchmarks in recent years (Hendrycks et al., 2021a; Chen et al., 2021; Zhuo et al., 2024). A notable example is SWE-bench (Jimenez et al., 2024), which assesses the ability of agents to generate patches to resolve real-world GitHub issues. While critical, these tasks generally remain within the skill set of an experienced software engineer. If LLM systems continue to improve at current rates, these tasks will be completely solvable. We are interested in benchmarks that exist further beyond both the frontier of expert human ability as well as current model ability.


On Leakage of Code Generation Evaluation Datasets

arXiv.org Artificial Intelligence

A second possibility is that contamination happens indirectly through the use Code generation has emerged as an important skill of synthetic data--a widespread paradigm used in for large language models to master. Measuring recent particular to increase code capabilities by generating progress in code generation has relied on few, additional code training tokens. Finally, we critical benchmarks to judge performance between argue that final model selection might have been model families and checkpoints. While many recent overly influenced by their performance on these sophisticated evaluation datasets have been datasets, overfitting to performance on these metrics proposed (Jain et al., 2024; Jimenez et al., 2024), over general-purpose code-oriented skills.


LLMCRIT: Teaching Large Language Models to Use Criteria

arXiv.org Artificial Intelligence

Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.


Improving Reward Models with Synthetic Critiques

arXiv.org Artificial Intelligence

Reward models (RM) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unseen distributions. We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback, evaluating aspects such as instruction following, correctness, and style. This offers richer signals and more robust features for RMs to assess and score on. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs initialized from different pretrained models. Conversely, we also show that low-quality critiques negatively impact performance. Furthermore, incorporating critiques enhances the interpretability and robustness of RM training.


Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

arXiv.org Artificial Intelligence

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.


BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

arXiv.org Artificial Intelligence

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.


A Maximum Matching Algorithm for Basis Selection in Spectral Learning

arXiv.org Machine Learning

We present a solution to scale spectral algorithms for learning sequence functions. We are interested in the case where these functions are sparse (that is, for most sequences they return 0). Spectral algorithms reduce the learning problem to the task of computing an SVD decomposition over a special type of matrix called the Hankel matrix. This matrix is designed to capture the relevant statistics of the training sequences. What is crucial is that to capture long range dependencies we must consider very large Hankel matrices. Thus the computation of the SVD becomes a critical bottleneck. Our solution finds a subset of rows and columns of the Hankel that realizes a compact and informative Hankel submatrix. The novelty lies in the way that this subset is selected: we exploit a maximal bipartite matching combinatorial algorithm to look for a sub-block with full structural rank, and show how computation of this sub-block can be further improved by exploiting the specific structure of Hankel matrices.


What Can I Do Now? Guiding Users in a World of Automated Decisions

arXiv.org Machine Learning

More and more processes governing our lives use in some part an automatic decision step, where -- based on a feature vector derived from an applicant -- an algorithm has the decision power over the final outcome. Here we present a simple idea which gives some of the power back to the applicant by providing her with alternatives which would make the decision algorithm decide differently. It is based on a formalization reminiscent of methods used for evasion attacks, and consists in enumerating the subspaces where the classifiers decides the desired output. This has been implemented for the specific case of decision forests (ensemble methods based on decision trees), mapping the problem to an iterative version of enumerating $k$-cliques.