Goto

Collaborating Authors

 confound


Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

Pistotti, Timothy, Brown, Jason, Witbrock, Michael

arXiv.org Artificial Intelligence

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.


From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology

Lin, Zhicheng

arXiv.org Artificial Intelligence

Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms--statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field's foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output--endorsing "I am anxious"--requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.


Reviews: On Testing for Biases in Peer Review

Neural Information Processing Systems

Thank you to the authors for the detailed response. It addresses most of my concerns. I hope the authors do include a discussion of effect sizes as they suggest in the response, since effect sizes are perhaps the most important thing to assess for a problem like this. I now see I misunderstood the importance of the assignment mchanism's confound in experimental design, compared to simple random assignment analysis; thank you for that clarification. The paper would still be strengthened if it related the problem to how it's addressed in the causal inference and experimental design literature, but the work is still a worthwhile contribution on its own.


Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

Oh, Byung-Doh, Schuler, William

arXiv.org Artificial Intelligence

Word-by-word conditional probabilities from Transformer-based language models are increasingly being used to evaluate their predictions over minimal pairs or to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the subword tokenization scheme of such language models, which has gone unaddressed thus far. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in word probabilities that sum to more than one, thereby violating the axiom that $\mathsf{P}(\Omega) = 1$. This property results in a misallocation of word-by-word surprisal, where the unacceptability of the current 'end of word' is incorrectly carried over to the next word. Additionally, language models' such implicit prediction of word boundaries is incongruous with psycholinguistic experiments where human subjects directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word, which resolves this confound. As a case study, we show that this results in significantly different estimates of garden-path effects in transitive/intransitive sentences, where a comma is strongly expected before the critical word.


Mole Recruitment: Poisoning of Image Classifiers via Selective Batch Sampling

Wisdom, Ethan, Gokhale, Tejas, Xiao, Chaowei, Yang, Yezhou

arXiv.org Artificial Intelligence

In this work, we present a data poisoning attack that confounds machine learning models without any manipulation of the image or label. This is achieved by simply leveraging the most confounding natural samples found within the training data itself, in a new form of a targeted attack coined "Mole Recruitment." We define moles as the training samples of a class that appear most similar to samples of another class, and show that simply restructuring training batches with an optimal number of moles can lead to significant degradation in the performance of the targeted class. We show the efficacy of this novel attack in an offline setting across several standard image classification datasets, and demonstrate the real-world viability of this attack in a continual learning (CL) setting. Our analysis reveals that state-of-the-art models are susceptible to Mole Recruitment, thereby exposing a previously undetected vulnerability of image classifiers.


Confounds and Overestimations in Fake Review Detection: Experimentally Controlling for Product-Ownership and Data-Origin

Soldner, Felix, Kleinberg, Bennett, Johnson, Shane

arXiv.org Artificial Intelligence

The popularity of online shopping is steadily increasing. At the same time, fake product reviews are published widely and have the potential to affect consumer purchasing behavior. In response, previous work has developed automated methods utilizing natural language processing approaches to detect fake product reviews. However, studies vary considerably in how well they succeed in detecting deceptive reviews, and the reasons for such differences are unclear. A contributing factor may be the multitude of strategies used to collect data, introducing potential confounds which affect detection performance. Two possible confounds are data-origin (i.e., the dataset is composed of more than one source) and product ownership (i.e., reviews written by individuals who own or do not own the reviewed product). In the present study, we investigate the effect of both confounds for fake review detection. Using an experimental design, we manipulate data-origin, product ownership, review polarity, and veracity. Supervised learning analysis suggests that review veracity (60.26 - 69.87%) is somewhat detectable but reviews additionally confounded with product-ownership (66.19 - 74.17%), or with data-origin (84.44 - 86.94%) are easier to classify. Review veracity is most easily classified if confounded with product-ownership and data-origin combined (87.78 - 88.12%). These findings are moderated by review polarity.


Confound-leakage: Confound Removal in Machine Learning Leads to Leakage

Hamdan, Sami, Love, Bradley C., von Polier, Georg G., Weis, Susanne, Schwender, Holger, Eickhoff, Simon B., Patil, Kaustubh R.

arXiv.org Artificial Intelligence

Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against na\"ive use of standard confound removal approaches.


Adversarial confound regression and uncertainty measurements to classify heterogeneous clinical MRI in Mass General Brigham

Leming, Matthew, Das, Sudeshna, Im, Hyungsoon

arXiv.org Artificial Intelligence

Automated disease detection in neuroimaging holds promise to improve the diagnostic ability of radiologists, but routinely collected clinical data frequently contains technical and demographic confounding factors that cause data to both differ between sites and be systematically associated with the disease of interest, thus negatively affecting the robustness of diagnostic models. There is a critical need for diagnostic deep learning models that can train on such imbalanced datasets without being influenced by these confounds. In this work, we introduce a novel deep learning architecture, MUCRAN (Multi-Confound Regression Adversarial Network), to train a deep learning model on clinical brain MRI while regressing demographic and technical confounding factors. We trained MUCRAN using 17,076 clinical T1 Axial brain MRIs collected from Massachusetts General Hospital before 2019 and demonstrated that MUCRAN could successfully regress major confounding factors in the vast clinical data. We also applied a method for quantifying uncertainty across an ensemble of these models to automatically exclude out-of-distribution data in the AD detection. By combining MUCRAN and the uncertainty quantification method, we showed consistent and significant increases in the AD detection accuracy for newly collected MGH data (post-2019) and for data from other hospitals. MUCRAN offers a generalizable approach for heterogenous clinical data for deep-learning-based automatic disease detection.


Topics to Avoid: Demoting Latent Confounds in Text Classification

Kumar, Sachin, Wintner, Shuly, Smith, Noah A., Tsvetkov, Yulia

arXiv.org Machine Learning

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification . We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.


Robust Text Classification under Confounding Shift

Landeiro, Virgile, Culotta, Aron

Journal of Artificial Intelligence Research

As statistical classifiers become integrated into real-world applications, it is important to consider not only their accuracy but also their robustness to changes in the data distribution. Although identifying and controlling for confounding variables Z - correlated with both the input X of a classifier and its output Y - has been assiduously studied in empirical social science, it is often neglected in text classification. This can be understood by the fact that, if we assume that the impact of confounding variables does not change between the time we fit a model and the time we use it, then prediction accuracy should only be slightly affected. We show in this paper that this assumption often does not hold and that when the influence of a confounding variable changes from training time to prediction time (i.e. under confounding shift), the classifier accuracy can degrade rapidly. We use Pearl's back-door adjustment as a predictive framework to develop a model robust to confounding shift under the condition that Z is observed at training time. Our approach does not make any causal conclusions but by experimenting on 6 datasets, we show that our approach is able to outperform baselines 1) in controlled cases where confounding shift is manually injected between fitting time and prediction time 2) in natural experiments where confounding shift appears either abruptly or gradually 3) in cases where there is one or multiple confounders. Finally, we discuss multiple issues we encountered during this research such as the effect of noise in the observation of Z and the importance of only controlling for confounding variables.