Goto

Collaborating Authors

 underspecification



Accounting for Underspecification in Statistical Claims of Model Superiority

Sanchez, Thomas, Gordaliza, Pedro M., Cuadra, Meritxell Bach

arXiv.org Artificial Intelligence

Machine learning methods are increasingly applied in medical imaging, yet many reported improvements lack statistical robustness: recent works have highlighted that small but significant performance gains are highly likely to be false positives. However, these analyses do not take \emph{underspecification} into account -- the fact that models achieving similar validation scores may behave differently on unseen data due to random initialization or training dynamics. Here, we extend a recent statistical framework modeling false outperformance claims to include underspecification as an additional variance component. Our simulations demonstrate that even modest seed variability ($\sim1\%$) substantially increases the evidence required to support superiority claims. Our findings underscore the need for explicit modeling of training variance when validating medical imaging systems.


What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts

Yang, Chenyang, Shi, Yike, Ma, Qianou, Liu, Michael Xieyang, Kästner, Christian, Wu, Tongshuang

arXiv.org Artificial Intelligence

Prompt underspecification is a common challenge when interacting with LLMs. In this paper, we present an in-depth analysis of this problem, showing that while LLMs can often infer unspecified requirements by default (41.1%), such behavior is fragile: Under-specified prompts are 2x as likely to regress across model or prompt changes, sometimes with accuracy drops exceeding 20%. This instability makes it difficult to reliably build LLM applications. Moreover, simply specifying all requirements does not consistently help, as models have limited instruction-following ability and requirements can conflict. Standard prompt optimizers likewise provide little benefit. To address these issues, we propose requirements-aware prompt optimization mechanisms that improve performance by 4.8% on average over baselines. We further advocate for a systematic process of proactive requirements discovery, evaluation, and monitoring to better manage prompt underspecification in practice.


On Aligning Prediction Models with Clinical Experiential Learning: A Prostate Cancer Case Study

Vallon, Jacqueline J., Overman, William, Xu, Wanqiao, Panjwani, Neil, Ling, Xi, Vij, Sushmita, Bagshaw, Hilary P., Leppert, John T., Shah, Sumit, Sonn, Geoffrey, Srinivas, Sandy, Pollom, Erqi, Buyyounouski, Mark K., Bayati, Mohsen

arXiv.org Artificial Intelligence

Over the past decade, the use of machine learning (ML) models in healthcare applications has rapidly increased. Despite high performance, modern ML models do not always capture patterns the end user requires. For example, a model may predict a non-monotonically decreasing relationship between cancer stage and survival, keeping all other features fixed. In this paper, we present a reproducible framework for investigating this misalignment between model behavior and clinical experiential learning, focusing on the effects of underspecification of modern ML pipelines. In a prostate cancer outcome prediction case study, we first identify and address these inconsistencies by incorporating clinical knowledge, collected by a survey, via constraints into the ML model, and subsequently analyze the impact on model performance and behavior across degrees of underspecification. The approach shows that aligning the ML model with clinical experiential learning is possible without compromising performance. Motivated by recent literature in generative AI, we further examine the feasibility of a feedback-driven alignment approach in non-generative AI clinical risk prediction models through a randomized experiment with clinicians. Our findings illustrate that, by eliciting clinicians' model preferences using our proposed methodology, the larger the difference in how the constrained and unconstrained models make predictions for a patient, the more apparent the difference is in clinical interpretation.




Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

Ismailov, Altynbek, Asanova, Salia

arXiv.org Artificial Intelligence

Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier ("max" to "min"); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three "reasoning-tuned" versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.


Principles of semantic and functional efficiency in grammatical patterning

Cheng, Emily, Franzon, Francesca

arXiv.org Artificial Intelligence

Grammatical features such as number and gender serve two central functions in human languages. While they encode salient semantic attributes like numerosity and animacy, they also offload sentence processing cost by predictably linking words together via grammatical agreement. Grammars exhibit consistent organizational patterns across diverse languages, invariably rooted in a semantic foundation, a widely confirmed but still theoretically unexplained phenomenon. To explain the basis of universal grammatical patterns, we unify two fundamental properties of grammar, semantic encoding and agreement-based predictability, into a single information-theoretic objective under cognitive constraints. Our analyses reveal that grammatical organization provably inherits from perceptual attributes, but that grammars empirically prioritize functional goals, promoting efficient language processing over semantic encoding.


Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST!

Wildenburg, Frank, Hanna, Michael, Pezzelle, Sandro

arXiv.org Artificial Intelligence

In everyday language use, speakers frequently utter and interpret sentences that are semantically underspecified, namely, whose content is insufficient to fully convey their message or interpret them univocally. For example, to interpret the underspecified sentence "Don't spend too much", which leaves implicit what (not) to spend, additional linguistic context or outside knowledge is needed. In this work, we propose a novel Dataset of semantically Underspecified Sentences grouped by Type (DUST) and use it to study whether pre-trained language models (LMs) correctly identify and interpret underspecified sentences. We find that newer LMs are reasonably able to identify underspecified sentences when explicitly prompted. However, interpreting them correctly is much harder for any LMs. Our experiments show that when interpreting underspecified sentences, LMs exhibit little uncertainty, contrary to what theoretical accounts of underspecification would predict. Overall, our study reveals limitations in current models' processing of sentence semantics and highlights the importance of using naturalistic data and communicative scenarios when evaluating LMs' language capabilities.


Beyond development: Challenges in deploying machine learning models for structural engineering applications

Esteghamati, Mohsen Zaker, Bean, Brennan, Burton, Henry V., Naser, M. Z.

arXiv.org Machine Learning

Machine learning (ML)-based solutions are rapidly changing the landscape of many fields, including structural engineering. Despite their promising performance, these approaches are usually only demonstrated as proof-of-concept in structural engineering, and are rarely deployed for real-world applications. This paper aims to illustrate the challenges of developing ML models suitable for deployment through two illustrative examples. Among various pitfalls, the presented discussion focuses on model overfitting and underspecification, training data representativeness, variable omission bias, and cross-validation.