confound
Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance
Pistotti, Timothy, Brown, Jason, Witbrock, Michael
Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.05)
From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology
Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms--statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field's foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output--endorsing "I am anxious"--requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- (9 more...)
- Research Report > Experimental Study (0.67)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
- Education > Assessment & Standards (0.46)
Reviews: On Testing for Biases in Peer Review
Thank you to the authors for the detailed response. It addresses most of my concerns. I hope the authors do include a discussion of effect sizes as they suggest in the response, since effect sizes are perhaps the most important thing to assess for a problem like this. I now see I misunderstood the importance of the assignment mchanism's confound in experimental design, compared to simple random assignment analysis; thank you for that clarification. The paper would still be strengthened if it related the problem to how it's addressed in the causal inference and experimental design literature, but the work is still a worthwhile contribution on its own.
Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities
Oh, Byung-Doh, Schuler, William
Word-by-word conditional probabilities from Transformer-based language models are increasingly being used to evaluate their predictions over minimal pairs or to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the subword tokenization scheme of such language models, which has gone unaddressed thus far. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in word probabilities that sum to more than one, thereby violating the axiom that $\mathsf{P}(\Omega) = 1$. This property results in a misallocation of word-by-word surprisal, where the unacceptability of the current 'end of word' is incorrectly carried over to the next word. Additionally, language models' such implicit prediction of word boundaries is incongruous with psycholinguistic experiments where human subjects directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word, which resolves this confound. As a case study, we show that this results in significantly different estimates of garden-path effects in transitive/intransitive sentences, where a comma is strongly expected before the critical word.
- Europe > France (0.05)
- North America > United States > Ohio (0.05)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
Mole Recruitment: Poisoning of Image Classifiers via Selective Batch Sampling
Wisdom, Ethan, Gokhale, Tejas, Xiao, Chaowei, Yang, Yezhou
In this work, we present a data poisoning attack that confounds machine learning models without any manipulation of the image or label. This is achieved by simply leveraging the most confounding natural samples found within the training data itself, in a new form of a targeted attack coined "Mole Recruitment." We define moles as the training samples of a class that appear most similar to samples of another class, and show that simply restructuring training batches with an optimal number of moles can lead to significant degradation in the performance of the targeted class. We show the efficacy of this novel attack in an offline setting across several standard image classification datasets, and demonstrate the real-world viability of this attack in a continual learning (CL) setting. Our analysis reveals that state-of-the-art models are susceptible to Mole Recruitment, thereby exposing a previously undetected vulnerability of image classifiers.
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > Arizona (0.04)
Confounds and Overestimations in Fake Review Detection: Experimentally Controlling for Product-Ownership and Data-Origin
Soldner, Felix, Kleinberg, Bennett, Johnson, Shane
The popularity of online shopping is steadily increasing. At the same time, fake product reviews are published widely and have the potential to affect consumer purchasing behavior. In response, previous work has developed automated methods utilizing natural language processing approaches to detect fake product reviews. However, studies vary considerably in how well they succeed in detecting deceptive reviews, and the reasons for such differences are unclear. A contributing factor may be the multitude of strategies used to collect data, introducing potential confounds which affect detection performance. Two possible confounds are data-origin (i.e., the dataset is composed of more than one source) and product ownership (i.e., reviews written by individuals who own or do not own the reviewed product). In the present study, we investigate the effect of both confounds for fake review detection. Using an experimental design, we manipulate data-origin, product ownership, review polarity, and veracity. Supervised learning analysis suggests that review veracity (60.26 - 69.87%) is somewhat detectable but reviews additionally confounded with product-ownership (66.19 - 74.17%), or with data-origin (84.44 - 86.94%) are easier to classify. Review veracity is most easily classified if confounded with product-ownership and data-origin combined (87.78 - 88.12%). These findings are moderated by review polarity.
- North America > United States > Maryland > Baltimore (0.04)
- Asia > Singapore (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Services (1.00)
- Retail (0.88)
Confound-leakage: Confound Removal in Machine Learning Leads to Leakage
Hamdan, Sami, Love, Bradley C., von Polier, Georg G., Weis, Susanne, Schwender, Holger, Eickhoff, Simon B., Patil, Kaustubh R.
Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against na\"ive use of standard confound removal approaches.
- Europe > Germany > North Rhine-Westphalia > Düsseldorf Region > Düsseldorf (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
- North America > Greenland (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Adversarial confound regression and uncertainty measurements to classify heterogeneous clinical MRI in Mass General Brigham
Leming, Matthew, Das, Sudeshna, Im, Hyungsoon
Automated disease detection in neuroimaging holds promise to improve the diagnostic ability of radiologists, but routinely collected clinical data frequently contains technical and demographic confounding factors that cause data to both differ between sites and be systematically associated with the disease of interest, thus negatively affecting the robustness of diagnostic models. There is a critical need for diagnostic deep learning models that can train on such imbalanced datasets without being influenced by these confounds. In this work, we introduce a novel deep learning architecture, MUCRAN (Multi-Confound Regression Adversarial Network), to train a deep learning model on clinical brain MRI while regressing demographic and technical confounding factors. We trained MUCRAN using 17,076 clinical T1 Axial brain MRIs collected from Massachusetts General Hospital before 2019 and demonstrated that MUCRAN could successfully regress major confounding factors in the vast clinical data. We also applied a method for quantifying uncertainty across an ensemble of these models to automatically exclude out-of-distribution data in the AD detection. By combining MUCRAN and the uncertainty quantification method, we showed consistent and significant increases in the AD detection accuracy for newly collected MGH data (post-2019) and for data from other hospitals. MUCRAN offers a generalizable approach for heterogenous clinical data for deep-learning-based automatic disease detection.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > Switzerland (0.04)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Health Care Providers & Services (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Topics to Avoid: Demoting Latent Confounds in Text Classification
Kumar, Sachin, Wintner, Shuly, Smith, Noah A., Tsvetkov, Yulia
Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification . We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.
- Europe > Greece (0.05)
- Europe > France (0.05)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- (25 more...)
Robust Text Classification under Confounding Shift
Landeiro, Virgile, Culotta, Aron
As statistical classifiers become integrated into real-world applications, it is important to consider not only their accuracy but also their robustness to changes in the data distribution. Although identifying and controlling for confounding variables Z - correlated with both the input X of a classifier and its output Y - has been assiduously studied in empirical social science, it is often neglected in text classification. This can be understood by the fact that, if we assume that the impact of confounding variables does not change between the time we fit a model and the time we use it, then prediction accuracy should only be slightly affected. We show in this paper that this assumption often does not hold and that when the influence of a confounding variable changes from training time to prediction time (i.e. under confounding shift), the classifier accuracy can degrade rapidly. We use Pearl's back-door adjustment as a predictive framework to develop a model robust to confounding shift under the condition that Z is observed at training time. Our approach does not make any causal conclusions but by experimenting on 6 datasets, we show that our approach is able to outperform baselines 1) in controlled cases where confounding shift is manually injected between fitting time and prediction time 2) in natural experiments where confounding shift appears either abruptly or gradually 3) in cases where there is one or multiple confounders. Finally, we discuss multiple issues we encountered during this research such as the effect of noise in the observation of Z and the importance of only controlling for confounding variables.
- Europe > United Kingdom > Scotland (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (9 more...)
- Leisure & Entertainment (1.00)
- Health & Medicine (1.00)
- Government (0.93)
- Media > Film (0.68)