Goto

Collaborating Authors

 item response theory


MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

Varadarajan, Vasudha, Xu, Hui, Boehme, Rebecca Astrid, Mirstrom, Mariam Marlan, Sikstrom, Sverker, Schwartz, H. Andrew

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.



JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

Yao, Louie Hong, Jarvis, Nicholas, Zhan, Tiffany, Ghosh, Saptarshi, Liu, Linfeng, Jiang, Tianyu

arXiv.org Artificial Intelligence

Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.



Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory

Cardoso, Lucas, Santos, Vitor, Filho, José Ribeiro, Prudêncio, Ricardo, Kawasaki, Regiane, Alves, Ronnie

arXiv.org Artificial Intelligence

Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to characterize and guide the partitioning of datasets in the model validation stage. The impact of IRT-informed partitioning strategies on the performance of several ML models in four tabular datasets was evaluated. The results obtained demonstrate that IRT reveals an inherent heterogeneity of the instances and highlights the existence of informative subgroups of instances within the same dataset. Based on IRT, balanced partitions were created that consistently help to better understand the tradeoff between bias and variance of the models. In addition, the guessing parameter proved to be a determining factor: training with high-guessing instances can significantly impair model performance and resulted in cases with accuracy below 50%, while other partitions reached more than 70% in the same dataset.


The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment

Fokin, Danil, Płużyczka, Monika, Golovin, Grigory

arXiv.org Artificial Intelligence

We present the Polish Vocabulary Size Test (PVST), a novel tool for assessing the receptive vocabulary size of both native and non-native Polish speakers. Based on Item Response Theory and Computerized Adaptive Testing, PVST dynamically adjusts to each test-taker's proficiency level, ensuring high accuracy while keeping the test duration short. To validate the test, a pilot study was conducted with 1.475 participants. Native Polish speakers demonstrated significantly larger vocabularies compared to non-native speakers. For native speakers, vocabulary size showed a strong positive correlation with age. The PVST is available online at myvocab.info/pl.


Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

Srivatsa, KV Aditya, Maurya, Kaushal Kumar, Kochmar, Ekaterina

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models' performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.


General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Zhou, Lexin, Pacchiardi, Lorenzo, Martínez-Plumed, Fernando, Collins, Katherine M., Moros-Daval, Yael, Zhang, Seraphina, Zhao, Qinlin, Huang, Yitian, Sun, Luning, Prunty, Jonathan E., Li, Zongqian, Sánchez-García, Pablo, Chen, Kexin Jiang, Casares, Pablo A. M., Zu, Jiyun, Burden, John, Mehrbakhsh, Behzad, Stillwell, David, Cebrian, Manuel, Wang, Jindong, Henderson, Peter, Wu, Sherry Tongshuang, Kyllonen, Patrick C., Cheke, Lucy, Xie, Xing, Hernández-Orallo, José

arXiv.org Artificial Intelligence

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)


A Spectral Approach to Item Response Theory

Neural Information Processing Systems

The Rasch model is one of the most fundamental models in item response theory and has wide-ranging applications from education testing to recommendation systems. In a universe with n users and m items, the Rasch model assumes that the binary response X_{li} \in \{0,1\} of a user l with parameter \theta *_l to an item i with parameter \beta *_i (e.g., a user likes a movie, a student correctly solves a problem) is distributed as \mathbb{P}(X_{li} 1) 1/(1 \exp(-(\theta *_l - \beta *_i))) . In this paper, we propose a new item estimation algorithm for this celebrated model (i.e., to estimate \beta *). The core of our algorithm is the computation of the stationary distribution of a Markov chain defined on an item-item graph. We complement our algorithmic contributions with finite-sample error guarantees, the first of their kind in the literature, showing that our algorithm is consistent and enjoys favorable optimality properties.


An Item Response Theory-based R Module for Algorithm Portfolio Analysis

Oldfield, Brodie, Kandanaarachchi, Sevvandi, Xu, Ziqi, Muñoz, Mario Andrés

arXiv.org Artificial Intelligence

Experimental evaluation is crucial in AI research, especially for assessing algorithms across diverse tasks. Many studies often evaluate a limited set of algorithms, failing to fully understand their strengths and weaknesses within a comprehensive portfolio. This paper introduces an Item Response Theory (IRT) based analysis tool for algorithm portfolio evaluation called AIRT-Module. Traditionally used in educational psychometrics, IRT models test question difficulty and student ability using responses to test questions. Adapting IRT to algorithm evaluation, the AIRT-Module contains a Shiny web application and the R package airt. AIRT-Module uses algorithm performance measures to compute anomalousness, consistency, and difficulty limits for an algorithm and the difficulty of test instances. The strengths and weaknesses of algorithms are visualised using the difficulty spectrum of the test instances. AIRT-Module offers a detailed understanding of algorithm capabilities across varied test instances, thus enhancing comprehensive AI method assessment. It is available at https://sevvandi.shinyapps.io/AIRT/ .