Campos, Jon Ander
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions
Rakotonirina, Nathanaël Carraz, Hamdy, Mohammed, Campos, Jon Ander, Weber, Lucas, Testoni, Alberto, Fadaee, Marzieh, Pezzelle, Sandro, Del Tredici, Marco
Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.
LLMs can implicitly learn from mistakes in-context
Alazraki, Lisa, Mozes, Maximilian, Campos, Jon Ander, Tan, Yi Chern, Rei, Marek, Bartolo, Max
Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier
Dang, John, Singh, Shivalika, D'souza, Daniel, Ahmadian, Arash, Salamanca, Alejandro, Smith, Madeline, Peppin, Aidan, Hong, Sungjin, Govindassamy, Manoj, Zhao, Terrence, Kublik, Sandra, Amer, Meor, Aryabumi, Viraat, Campos, Jon Ander, Tan, Yi-Chern, Kocmi, Tom, Strub, Florian, Grinsztajn, Nathan, Flet-Berliac, Yannis, Locatelli, Acyr, Lin, Hangyu, Talupuru, Dwarak, Venkitesh, Bharat, Cairuz, David, Yang, Bowen, Chung, Tim, Ko, Wei-Yin, Shi, Sylvie Shang, Shukayev, Amir, Bae, Sammie, Piktus, Aleksandra, Castagné, Roman, Cruz-Salinas, Felipe, Kim, Eddie, Crawhall-Stein, Lucas, Morisot, Adrien, Roy, Sudip, Blunsom, Phil, Zhang, Ivan, Gomez, Aidan, Frosst, Nick, Fadaee, Marzieh, Ermis, Beyza, Üstün, Ahmet, Hooker, Sara
We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models, aiming to address the critical challenge of developing highly performant multilingual models that match or surpass the capabilities of monolingual models. By leveraging several years of research at Cohere For AI and Cohere, including advancements in data arbitrage, multilingual preference training, and model merging, Aya Expanse sets a new state-of-the-art in multilingual performance. Our evaluations on the Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya Expanse 8B and 32B outperform leading open-weight models in their respective parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model with twice as many parameters, achieving a 54.0% win-rate. In this short technical report, we present extended evaluation results for the Aya Expanse model family and release their open-weights, together with a new multilingual evaluation dataset m-ArenaHard.
Improving Reward Models with Synthetic Critiques
Ye, Zihuiwen, Greenlee-Scott, Fraser, Bartolo, Max, Blunsom, Phil, Campos, Jon Ander, Gallé, Matthias
Reward models (RM) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unseen distributions. We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback, evaluating aspects such as instruction following, correctness, and style. This offers richer signals and more robust features for RMs to assess and score on. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs initialized from different pretrained models. Conversely, we also show that low-quality critiques negatively impact performance. Furthermore, incorporating critiques enhances the interpretability and robustness of RM training.
Aya 23: Open Weight Releases to Further Multilingual Progress
Aryabumi, Viraat, Dang, John, Talupuru, Dwarak, Dash, Saurabh, Cairuz, David, Lin, Hangyu, Venkitesh, Bharat, Smith, Madeline, Campos, Jon Ander, Tan, Yi Chern, Marchisio, Kelly, Bartolo, Max, Ruder, Sebastian, Locatelli, Acyr, Kreutzer, Julia, Frosst, Nick, Gomez, Aidan, Blunsom, Phil, Fadaee, Marzieh, Üstün, Ahmet, Hooker, Sara
This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (\"Ust\"un et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.
When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
Labruna, Tiziano, Campos, Jon Ander, Azkune, Gorka
In this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM's parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token,
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark
Sainz, Oscar, Campos, Jon Ander, García-Ferrero, Iker, Etxaniz, Julen, de Lacalle, Oier Lopez, Agirre, Eneko
In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.
Unsupervised Domain Adaption for Neural Information Retrieval
Dominguez, Carlos, Campos, Jon Ander, Agirre, Eneko, Azkune, Gorka
Neural information retrieval requires costly annotated data for each target domain to be competitive. Synthetic annotation by query generation using Large Language Models or rulebased string manipulation has been proposed as an alternative, but their relative merits have not been analysed. In this paper, we compare both methods head-to-head using the same neural IR architecture. We focus on the BEIR benchmark, which includes test datasets from several domains with no training data, and explore two scenarios: zero-shot, where the supervised system is trained in a large out-ofdomain dataset (MS-MARCO); and unsupervised Figure 1: Experimental design: (left) a supervised retriever domain adaptation, where, in addition to is trained with manual annotations from MS-MS-MARCO, the system is fine-tuned in synthetic MARCO; (middle) an unsupervised retriever is trained data from the target domain. Our results with automatically generated queries for MS-MARCO indicate that Large Language Models outperform documents; (right) an unsupervised domain adaptation rule-based methods in all scenarios by a retriever is trained with both MS-MARCO manual annotations large margin, and, more importantly, that unsupervised and automatically generated queries in-domain domain adaptation is effective compared BEIR dataset documents. Evaluation is performed in to applying a supervised IR system in a BEIR producing two scenarios: zero-shot (left and middle zero-shot fashion. In addition we explore several retrievers); unsupervised domain adaptation (right sizes of open Large Language Models to retriever).
IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named Entity Recognition using Knowledge Bases
García-Ferrero, Iker, Campos, Jon Ander, Sainz, Oscar, Salaberria, Ander, Roth, Dan
Named Entity Recognition (NER) is a core natural language processing task in which pretrained language models have shown remarkable performance. However, standard benchmarks like CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) do not address many of the challenges that deployed NER systems face, such as having to classify emerging or complex entities in a fine-grained way. In this paper we present a novel NER cascade approach comprising three steps: first, identifying candidate entities in the input sentence; second, linking the each candidate to an existing knowledge base; third, predicting the fine-grained category for each entity candidate. We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities. Our system exhibits robust performance in the MultiCoNER2 (Fetahu et al., 2023b) shared task, even in the low-resource language setting where we leverage knowledge bases of high-resource languages.
Training Language Models with Language Feedback at Scale
Scheurer, Jérémy, Campos, Jon Ander, Korbak, Tomasz, Chan, Jun Shern, Chen, Angelica, Cho, Kyunghyun, Perez, Ethan
Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.