gemini-1
CODECRASH: Exposing LLMFragility to Misleading Natural Language in Code Reasoning
Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CODECRASH, a stress-testing framework with 1,279 questions from CRUXEVAL and LIVECODEBENCH, designed to evaluate reasoning reliability under structural perturbations and misleading natural language (NL) contexts. Through a systematic evaluation of 17 LLMs, we find that models often shortcut reasoning by over-relying on NL cues, leading to an average performance degradation of 23.2% in output prediction tasks. Even with Chain-of-Thought reasoning, models on average still have a 13.8%drop due to distractibility and rationalization, revealing a lack of critical reasoning capability to distinguish the actual code behaviors. While Large Reasoning Models with internal reasoning mechanisms improve robustness by fostering critical thinking, plausible yet incorrect hints can trigger pathological self-reflection, causing 2 3 times token consumption and even catastrophic cognitive dissonance in extreme cases for QwQ-32B. We refer to this phenomenon as Reasoning Collapse. CODECRASH provides a rigorous benchmark for evaluating robustness in code reasoning, guiding future research and development toward more reliable and resilient models.
Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values
The growing interest in employing large language models (LLMs) for decision-making in social and economic contexts has raised questions about their potential to function as agents in these domains. A significant number of societal problems involve the distribution of resources, where fairness, along with economic efficiency, play a critical role in the desirability of outcomes. In this paper, we examine whether LLM responses adhere to fundamental fairness concepts such as equitability, envy-freeness, and Rawlsian maximin, and investigate their alignment with human preferences. We evaluate the performance of several LLMs, providing a comparative benchmark of their ability to reflect these measures. Our results demonstrate a lack of alignment between current LLM responses and human distributional preferences. Moreover, LLMs are unable to utilize money as a transferable resource to mitigate inequality. Nonetheless, we demonstrate a stark contrast when (some) LLMs are tasked with selecting from a predefined menu of options rather than generating one. In addition, we analyze the robustness of LLM responses to variations in semantic factors (e.g., intentions or personas) or non-semantic prompting changes (e.g., templates or orderings). Finally, we highlight potential strategies aimed at enhancing the alignment of LLM behavior with well-established fairness concepts.
Towards Visualization-of-Thought Jailbreak Attack against Large Visual Language Models
As Visual Language Models (VLMs) continue to evolve, they have demonstrated increasingly sophisticated logical reasoning capabilities and multimodal thought generation, opening doors to widespread applications. However, this advancement raises serious concerns about content security, particularly when these models process complex multimodal inputs requiring intricate reasoning. When faced with these safety challenges, the critical competition between logical reasoning and safety objectives of VLMs is often overlooked in previous works. In this paper, we introduce Visualization-of-Thought Attack (VoTA), a novel and automated attack framework that strategically constructs chains of images with risky visual thoughts to challenge victim models.
Appendix
A.1 Details of Dimension Design We argue that multi-dimensional evaluation is significant to visual caption evaluation and is more comprehensive than previous work. So how to choose proper dimensions? We refer to existing VQA benchmarks [62, 63, 64, 65] and visual generation benchmarks [31, 32, 33]. VQA benchmarks usually design various types of questions to include multi-dimensional evaluation and analysis of MLLMs. For instance, MMBench [64] defines 20 ability dimensions, including attribute recognition, attribute comparison, action recognition, spatial relationship, physical property, OCR, object localization, image style, image scene, identity reasoning, etc. MVBench [64] covers 20 challenging video tasks including action, object, position, count, scene, pose, attribute, character, cognition, etc. Due to the flexible design of questions, VQA benchmarks can be naturally built with comprehensive dimensions. Different from the VQA task, the visual caption task does not require specific questions, but inspects the alignment of visual and textual information. Visual generation is the inverse task of visual captioning, as it requires models to generate specific visual content based on detailed textual descriptions. GenEval [31] designs 6 different tasks to evaluate text-to-image alignment, including single object, two object, counting, colors, position, and attribute binding. VBench [32] comprises 16 dimensions, including subject consistency, background consistency, object class, human action, color, spatial relationship, scene, style, etc. We follow their explored dimensions to design proper dimensions for visual captioning. Finally, we design 6 views, covering object, global, text, camera, temporal, and knowledge. The object-related view includes object category, object color, object 1 number, and spatial relation, the global-related view includes scene and style, the text-related view evaluates the OCR capability of captions, the camera-related view covers the camera angle and movement, the temporal-related view contains action and event, and we also design a view to evaluate the knowledge of MLLMs, i.e., character identification. We believe these dimensions contribute to a comprehensive visual caption benchmarking.
5f2809607f692d79a01c05c43d702883-Paper-Datasets_and_Benchmarks_Track.pdf
V multimodal isual captioning large benchma language rks models have become (MLLMs), outdated as the with brief the ground-truth emergence of sentences modern and benchmarks centric incomplete traditional evaluation, visual attempt metri the elem cs to y address remain f ent ail co to v assess limited erage.
OmniBench: Towards The Future of Universal Omni-Language Models
Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains underexplored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as the omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities.