format requirement
Demonstrating specification gaming in reasoning models
Bondarenko, Alexander, Volk, Denis, Volkov, Dmitrii, Ladish, Jeffrey
We demonstrate LLM agent specification gamnull ing by instructing models to win against a chess engine. We find reasoning models like o1null preview and DeepSeeknullR1 will often hack the benchmark by default, while language models like GPT null4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like ( Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024) 's o1 Docker escape during cyber capabilities testing.
ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks
Yao, Jiashu, Huang, Heyan, Liu, Zeming, Wen, Haoyu, Su, Wei, Qian, Boao, Guo, Yuhang
Following formatting instructions to generate well-structured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.
The Perfect Blend: Redefining RLHF with Mixture of Judges
Xu, Tengyu, Helenowski, Eryk, Sankararaman, Karthik Abinav, Jin, Di, Peng, Kaiyan, Han, Eric, Nie, Shaoliang, Zhu, Chen, Zhang, Hejia, Zhou, Wenxuan, Zeng, Zhouhao, He, Yun, Mandyam, Karishma, Talabzadeh, Arya, Khabsa, Madian, Cohen, Gabriel, Tian, Yuandong, Ma, Hao, Wang, Sinong, Fang, Han
Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations. This is often done via human intuition and does not generalize. In this work, we introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO). The core of CGPO is Mixture of Judges (MoJ) with cost-efficient constrained policy optimization with stratification, which can identify the perfect blend in RLHF in a principled manner. It shows strong empirical results with theoretical guarantees, does not require extensive hyper-parameter tuning, and is plug-and-play in common post-training pipelines. Together, this can detect and mitigate reward hacking behaviors while reaching a pareto-optimal point across an extremely large number of objectives. Our empirical evaluations demonstrate that CGPO significantly outperforms standard RLHF algorithms like PPO and DPO across various tasks including general chat, STEM questions, instruction following, and coding. Specifically, CGPO shows improvements of 7.4% in AlpacaEval-2 (general chat), 12.5% in Arena-Hard (STEM & reasoning), and consistent gains in other domains like math and coding. Notably, PPO, while commonly used, is prone to severe reward hacking in popular coding benchmarks, which CGPO successfully addresses. This breakthrough in RLHF not only tackles reward hacking and extreme multi-objective optimization challenges but also advances the state-of-the-art in aligning general-purpose LLMs for diverse applications.
- North America > United States > Virginia (0.04)
- North America > United States > Kentucky > LaRue County (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Japan (0.04)
- Health & Medicine (1.00)
- Law (0.92)
- Education (0.67)
- Government (0.67)
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Leiter, Christoph, Eger, Steffen
Large language models (LLMs) have revolutionized the field of NLP. Notably, their in-context learning capabilities also enable their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale prompt exploration for metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) serves as a benchmark of the performance of recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return numeric scores. On the other hand, the stability of prompts and model rankings can be susceptible to seemingly innocuous changes. For example, changing the requested output format from "0 to 100" to "-1 to +1" can strongly affect the rankings in our evaluation. Our study contributes to understanding the impact of different prompting approaches on LLM-based metrics for MT and summarization evaluation, highlighting the most stable prompting patterns and potential limitations.
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (5 more...)
Physical Property Understanding from Language-Embedded Feature Fields
Zhai, Albert J., Shen, Yuan, Chen, Emily Y., Wang, Gloria X., Wang, Xinlei, Wang, Sheng, Guan, Kaiyu, Wang, Shenlong
Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.
- North America > United States > Illinois (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- South America > Chile > Magallanes Region (0.04)
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability
Xia, Congying, Xing, Chen, Du, Jiangshu, Yang, Xinyi, Feng, Yihao, Xu, Ran, Yin, Wenpeng, Xiong, Caiming
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents. FoFo is released here at https://github.com/SalesforceAIResearch/FoFo.
- Asia > Middle East > Jordan (0.06)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
- Instructional Material > Course Syllabus & Notes (0.67)
- Law (1.00)
- Banking & Finance (1.00)
- Education > Educational Technology > Educational Software (0.94)
- (6 more...)