Goto

Collaborating Authors

 brittleness


the four main areas of criticism below (reviewers referred to as R1-5)

Neural Information Processing Systems

We first thank the reviewers for their insightful comments which we have taken into careful consideration. If our work were to be evaluated using only performance metrics, this criticism would be fair. Learning paradigms for networks of'convex layers' have been shown to be effective (e.g. The key advance over standard SCNs is that we show how to perform non-linear computations in these systems. Standard SCNs such as in Boerlin et al (2013) are restricted to linear computations. It may seem surprising, but such layers are actually not well understood!


On the Brittleness of CLIP Text Encoders

Tran, Allie, Rossetto, Luca

arXiv.org Artificial Intelligence

Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.


The Microsoft Azure Outage Shows the Harsh Reality of Cloud Failures

WIRED

The second major cloud outage in less than two weeks, Azure's downtime highlights the "brittleness" of a digital ecosystem that depends on a few companies never making mistakes. Microsoft's Azure cloud platform, its widely used 365 services, Xbox, and Minecraft started suffering outages at roughly noon Eastern time on Wednesday, the result of what Microsoft said was "an inadvertent configuration change." The incident--which marks the second major cloud provider outage in less than two weeks--highlights the instability of an internet built largely on infrastructure run by a few tech giants. Microsoft's problems specifically originated from Azure's Front Door content delivery network and emerged just hours before Microsoft's scheduled earnings announcement. The company website, including its investor relations page, was still down on Wednesday afternoon, and the Azure status page where Microsoft provides updates was having intermittent issues as well.


Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Shahani, Prithviraj Singh, Miandoab, Kaveh Eskandari, Scheutz, Matthias

arXiv.org Artificial Intelligence

Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.


MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

Kirtane, Neeraja, Khanna, Yuvraj, Relan, Peter

arXiv.org Artificial Intelligence

Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.


A Single Character can Make or Break Your LLM Evals

Su, Jingtong, Zhang, Jianyu, Ullrich, Karen, Bottou, Léon, Ibrahim, Mark

arXiv.org Artificial Intelligence

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by $\pm 23\%$ depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.


the four main areas of criticism below (reviewers referred to as R1-5)

Neural Information Processing Systems

We first thank the reviewers for their insightful comments which we have taken into careful consideration. If our work were to be evaluated using only performance metrics, this criticism would be fair. Learning paradigms for networks of'convex layers' have been shown to be effective (e.g. The key advance over standard SCNs is that we show how to perform non-linear computations in these systems. Standard SCNs such as in Boerlin et al (2013) are restricted to linear computations. It may seem surprising, but such layers are actually not well understood!


Overfitting in Adaptive Robust Optimization

Zhu, Karl, Bertsimas, Dimitris

arXiv.org Machine Learning

Adaptive robust optimization (ARO) extends static robust optimization by allowing decisions to depend on the realized uncertainty - weakly dominating static solutions within the modeled uncertainty set. However, ARO makes previous constraints that were independent of uncertainty now dependent, making it vulnerable to additional infeasibilities when realizations fall outside the uncertainty set. This phenomenon of adaptive policies being brittle is analogous to overfitting in machine learning. To mitigate against this, we propose assigning constraint-specific uncertainty set sizes, with harder constraints given stronger probabilistic guarantees. Interpreted through the overfitting lens, this acts as regularization: tighter guarantees shrink adaptive coefficients to ensure stability, while looser ones preserve useful flexibility. This view motivates a principled approach to designing uncertainty sets that balances robustness and adaptivity.


The Biased Samaritan: LLM biases in Perceived Kindness

Fagan, Jack H, Juyaal, Ruhaan, Yu, Amy Yue-Ming, Pun, Siya

arXiv.org Artificial Intelligence

While Large Language Models (LLMs) have become ubiquitous in many fields, understanding and mitigating LLM biases is an ongoing issue. This paper provides a novel method for evaluating the demographic biases of various generative AI models. By prompting models to assess a moral patient's willingness to intervene constructively, we aim to quantitatively evaluate different LLMs' biases towards various genders, races, and ages. Our work differs from existing work by aiming to determine the baseline demographic identities for various commercial models and the relationship between the baseline and other demographics. We strive to understand if these biases are positive, neutral, or negative, and the strength of these biases. This paper can contribute to the objective assessment of bias in Large Language Models and give the user or developer the power to account for these biases in LLM output or in training future LLMs. Our analysis suggested two key findings: that models view the baseline demographic as a white middle-aged or young adult male; however, a general trend across models suggested that non-baseline demographics are more willing to help than the baseline. These methodologies allowed us to distinguish these two biases that are often tangled together.


Towards LLMs Robustness to Changes in Prompt Format Styles

Ngweta, Lilian, Kate, Kiran, Tsay, Jason, Rizk, Yara

arXiv.org Artificial Intelligence

Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.