Tan, Liang
A Systematic Examination of Preference Learning through the Lens of Instruction-Following
Kim, Joongwon, Goyal, Anirudh, Zhang, Aston, Xiong, Bo, Hou, Rui, Kambadur, Melanie, Mahajan, Dhruv, Hajishirzi, Hannaneh, Tan, Liang
Preference learning is a widely adopted post-training technique that aligns large language models (LLMs) to human preferences and improves specific downstream task capabilities. In this work we systematically investigate how specific attributes of preference datasets affect the alignment and downstream performance of LLMs in instruction-following tasks. We use a novel synthetic data generation pipeline to generate 48,000 unique instruction-following prompts with combinations of 23 verifiable constraints that enable fine-grained and automated quality assessments of model responses. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) - to obtain pairs of (chosen, rejected) responses. Then, we perform experiments investigating the effects of (1) the presence of shared prefixes between the chosen and rejected responses, (2) the contrast and quality of the chosen, rejected responses and (3) the complexity of the training prompts. Our experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements and greater stability across challenging training configurations. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance by balancing diversity and learning efficiency. Additionally, training on prompts of moderate difficulty leads to better generalization across tasks, even for more complex evaluation scenarios, compared to overly challenging prompts. Our findings provide actionable insights into optimizing preference data curation for instruction-following tasks, offering a scalable and effective framework for enhancing LLM training and alignment.
Self-Generated Critiques Boost Reward Modeling for Language Models
Yu, Yue, Chen, Zhengxing, Zhang, Aston, Tan, Liang, Zhu, Chenguang, Pang, Richard Yuanzhe, Qian, Yundi, Wang, Xuewei, Gururangan, Suchin, Zhang, Chao, Kambadur, Melanie, Mahajan, Dhruv, Hou, Rui
Reinforcement Learning from Human Feedback (RLHF) has been widely adopted to align large language models (LLMs) with human preferences (Ouyang et al., 2022; Touvron et al., 2023; Dubey et al., 2024; Reid et al., 2024). Central to the RLHF process is the reward model (RM), which is trained to assign scores that quantify how well the model's outputs align with human judgments. The reward model defines optimization direction during training (e.g., reward signal in PPO), encouraging a policy LLM to generate more helpful, honest, and harmless responses ultimately enhancing the model's generation quality in real-world applications. Standard reward models are typically trained using preference pairs and optimized with pairwise logistic loss (Bradley and Terry, 1952), producing a single scalar score for each response. However, outputting a scalar score not only is hard to interpret but also fails to fully leverage the inherent language modeling capability that LLMs obtain from pretraining and post-training (Zhang et al., 2024). Consequently, these reward models tend to be less data-efficient and prone to robustness issues, such as reward hacking (Skalse et al., 2022; Singhal et al., 2023; Chen et al., 2024).
Law of the Weakest Link: Cross Capabilities of Large Language Models
Zhong, Ming, Zhang, Aston, Wang, Xuewei, Hou, Rui, Xiong, Wenhan, Zhu, Chenguang, Chen, Zhengxing, Tan, Liang, Bi, Chloe, Lewis, Mike, Popuri, Sravya, Narang, Sharan, Kambadur, Melanie, Mahajan, Dhruv, Edunov, Sergey, Han, Jiawei, van der Maaten, Laurens
The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.
Audiobox: Unified Audio Generation with Natural Language Prompts
Vyas, Apoorv, Shi, Bowen, Le, Matthew, Tjandra, Andros, Wu, Yi-Chiao, Guo, Baishan, Zhang, Jiemin, Zhang, Xinyue, Adkins, Robert, Ngan, William, Wang, Jeff, Cruz, Ivan, Akula, Bapi, Akinyemi, Akinniyi, Ellis, Brian, Moritz, Rashel, Yungster, Yael, Rakotoarison, Alice, Tan, Liang, Summers, Chris, Wood, Carleigh, Lane, Joshua, Williamson, Mary, Hsu, Wei-Ning
Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/
Learning Easily Updated General Purpose Text Representations with Adaptable Task-Specific Prefixes
Huang, Kuan-Hao, Tan, Liang, Hou, Rui, Wang, Sinong, Almahairi, Amjad, Rinott, Ruty
Many real-world applications require making multiple predictions from the same text. Fine-tuning a large pre-trained language model for each downstream task causes computational burdens in the inference time due to several times of forward passes. To amortize the computational cost, freezing the language model and building lightweight models for downstream tasks based on fixed text representations are common solutions. Accordingly, how to learn fixed but general text representations that can generalize well to unseen downstream tasks becomes a challenge. Previous works have shown that the generalizability of representations can be improved by fine-tuning the pre-trained language model with some source tasks in a multi-tasking way. In this work, we propose a prefix-based method to learn the fixed text representations with source tasks. We learn a task-specific prefix for each source task independently and combine them to get the final representations. Our experimental results show that prefix-based training performs better than multi-tasking training and can update the text representations at a smaller computational cost than multi-tasking training.
UNIREX: A Unified Learning Framework for Language Model Rationale Extraction
Chan, Aaron, Sanjabi, Maziar, Mathias, Lambert, Tan, Liang, Nie, Shaoliang, Peng, Xiaochang, Ren, Xiang, Firooz, Hamed
An extractive rationale explains a language model's (LM's) prediction on a given task instance by highlighting the text inputs that most influenced the prediction. Ideally, rationale extraction should be faithful (reflective of LM's actual behavior) and plausible (convincing to humans), without compromising the LM's (i.e., task model's) task performance. Although attribution algorithms and select-predict pipelines are commonly used in rationale extraction, they both rely on certain heuristics that hinder them from satisfying all three desiderata. In light of this, we propose UNIREX, a flexible learning framework that generalizes rationale extractor optimization as follows: (1) specify architecture for a learned rationale extractor; (2) select explainability objectives (i.e., faithfulness and plausibility criteria); and (3) jointly the train task model and rationale extractor on the task using the selected objectives. UNIREX enables replacing prior works' heuristic design choices with a generic learned rationale extractor in (1) and optimizing it for all three desiderata in (2)-(3). To facilitate comparison between methods with respect to multiple desiderata, we introduce the Normalized Relative Gain (NRG) metric. Across five text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, we find that UNIREX-trained rationale extractors can even generalize to unseen datasets and tasks.
FRAME: Evaluating Rationale-Label Consistency Metrics for Free-Text Rationales
Chan, Aaron, Nie, Shaoliang, Tan, Liang, Peng, Xiaochang, Firooz, Hamed, Sanjabi, Maziar, Ren, Xiang
Following how humans communicate, free-text rationales aim to use natural language to explain neural language model (LM) behavior. However, free-text rationales' unconstrained nature makes them prone to hallucination, so it is important to have metrics for free-text rationale quality. Existing free-text rationale metrics measure how consistent the rationale is with the LM's predicted label, but there is no protocol for assessing such metrics' reliability. Thus, we propose FRAME, a framework for evaluating rationale-label consistency (RLC) metrics for free-text rationales. FRAME is based on three axioms: (1) good metrics should yield highest scores for reference rationales, which maximize RLC by construction; (2) good metrics should be appropriately sensitive to semantic perturbation of rationales; and (3) good metrics should be robust to variation in the LM's task performance. Across three text classification datasets, we show that existing RLC metrics cannot satisfy all three FRAME axioms, since they are implemented via model pretraining which muddles the metric's signal. Then, we introduce a non-pretraining RLC metric that greatly outperforms baselines on (1) and (3), while performing competitively on (2). Finally, we discuss the limitations of using RLC to evaluate free-text rationales.