AITopics | irt model

Collaborating Authors

irt model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

Scarlatos, Alexander, Fernandez, Nigel, Ormerod, Christopher, Lottridge, Susan, Lan, Andrew

arXiv.org Artificial IntelligenceSep-19-2025

Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.05129

Country:

North America > United States (1.00)
Europe (1.00)
North America > Mexico > Mexico City (0.14)

Genre: Research Report > Promising Solution (0.48)

Industry:

Education > Assessment & Standards (0.88)
Education > Educational Technology > Educational Software > Computer Based Training (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

The Course Difficulty Analysis Cookbook

Baucks, Frederik, Schmucker, Robin, Wiskott, Laurenz

arXiv.org Artificial IntelligenceAug-20-2025

Curriculum analytics (CA) studies curriculum structure and student data to ensure the quality of educational programs. An essential aspect is studying course properties, which involves assigning each course a representative difficulty value. This is critical for several aspects of CA, such as quality control (e.g., monitoring variations over time), course comparisons (e.g., articulation), and course recommendation (e.g., advising). Measuring course difficulty requires careful consideration of multiple factors: First, when difficulty measures are sensitive to the performance level of enrolled students, it can bias interpretations by overlooking student diversity. By assessing difficulty independently of enrolled students' performances, we can reduce the risk of bias and enable fair, representative assessments of difficulty. Second, from a measurement theoretic perspective, the measurement must be reliable and valid to provide a robust basis for subsequent analyses. Third, difficulty measures should account for covariates, such as the characteristics of individual students within a diverse populations (e.g., transfer status). In recent years, various notions of difficulty have been proposed. This paper provides the first comprehensive review and comparison of existing approaches for assessing course difficulty based on grade point averages and latent trait modeling. It further offers a hands-on tutorial on model selection, assumption checking, and practical CA applications. These applications include monitoring course difficulty over time and detecting courses with disparate outcomes between distinct groups of students (e.g., dropouts vs. graduates), ultimately aiming to promote high-quality, fair, and equitable learning experiences. To support further research and application, we provide an open-source software package and artificial datasets, facilitating reproducibility and adoption.

artificial intelligence, data mining, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2508.13218

Country:

Europe (1.00)
North America > United States > California (0.45)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Educational Setting > Higher Education (1.00)
Education > Curriculum (0.92)
Education > Assessment & Standards > Student Performance (0.68)

Technology:

Information Technology > Software (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration

Sharpnack, James, Hao, Kevin, Mulcaire, Phoebe, Bicknell, Klinton, LaFlair, Geoff, Yancey, Kevin, von Davier, Alina A.

arXiv.org Machine LearningOct-28-2024

In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework.

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Machine Learning

2410.21033

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Fairness Evaluation with Item Response Theory

Xu, Ziqi, Kandanaarachchi, Sevvandi, Ong, Cheng Soon, Ntoutsi, Eirini

arXiv.org Artificial IntelligenceOct-20-2024

Item Response Theory (IRT) has been widely used in educational psychometrics to assess student ability, as well as the difficulty and discrimination of test questions. In this context, discrimination specifically refers to how effectively a question distinguishes between students of different ability levels, and it does not carry any connotation related to fairness. In recent years, IRT has been successfully used to evaluate the predictive performance of Machine Learning (ML) models, but this paper marks its first application in fairness evaluation. In this paper, we propose a novel Fair-IRT framework to evaluate a set of predictive models on a set of individuals, while simultaneously eliciting specific parameters, namely, the ability to make fair predictions (a feature of predictive models), as well as the discrimination and difficulty of individuals that affect the prediction results. Furthermore, we conduct a series of experiments to comprehensively understand the implications of these parameters for fairness evaluation. Detailed explanations for item characteristic curves (ICCs) are provided for particular individuals. We propose the flatness of ICCs to disentangle the unfairness between individuals and predictive models. The experiments demonstrate the effectiveness of this framework as a fairness evaluation tool. Two real-world case studies illustrate its potential application in evaluating fairness in both classification and regression tasks. Our paper aligns well with the Responsible Web track by proposing a Fair-IRT framework to evaluate fairness in ML models, which directly contributes to the development of a more inclusive, equitable, and trustworthy AI.

data mining, machine learning, predictive model, (19 more...)

arXiv.org Artificial Intelligence

2411.02414

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Asia (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Law (1.00)
Education > Curriculum > Subject-Specific Education (0.47)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales

Wallmark, Joakim, Josefsson, Maria, Wiberg, Marie

arXiv.org Machine LearningOct-2-2024

Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In this study, we present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders. Using both simulated scenarios and real data from the Swedish Scholastic Aptitude Test, we demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit. Furthermore, we illustrate how the latent trait scale from any fitted IRT model can be transformed into a ratio scale, aiding in score interpretation and making it easier to compare different types of IRT models. We refer to these new scales as bit scales. Bit scales are especially useful for models for which minimal or no assumptions are made for the latent trait scale distributions, such as for the autoencoder fitted models in this study.

irt model, mmc model, test taker, (14 more...)

arXiv.org Machine Learning

2410.0148

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.04)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Assessment & Standards > Assessment Methods (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Implicit assessment of language learning during practice as accurate as explicit testing

Hou, Jue, Katinskaia, Anisia, Vu, Anh-Duc, Yangarber, Roman

arXiv.org Artificial IntelligenceSep-24-2024

Assessment of proficiency of the learner is an essential part of Intelligent Tutoring Systems (ITS). We use Item Response Theory (IRT) in computer-aided language learning for assessment of student ability in two contexts: in test sessions, and in exercises during practice sessions. Exhaustive testing across a wide range of skills can provide a detailed picture of proficiency, but may be undesirable for a number of reasons. Therefore, we first aim to replace exhaustive tests with efficient but accurate adaptive tests. We use learner data collected from exhaustive tests under imperfect conditions, to train an IRT model to guide adaptive tests. Simulations and experiments with real learner data confirm that this approach is efficient and accurate. Second, we explore whether we can accurately estimate learner ability directly from the context of practice with exercises, without testing. We transform learner data collected from exercise sessions into a form that can be used for IRT modeling. This is done by linking the exercises to {\em linguistic constructs}; the constructs are then treated as "items" within IRT. We present results from large-scale studies with thousands of learners. Using teacher assessments of student ability as "ground truth," we compare the estimates obtained from tests vs. those from exercises. The experiments confirm that the IRT models can produce accurate ability estimation based on exercises.

artificial intelligence, learner, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2409.16133

Country:

Europe > Finland > Uusimaa > Helsinki (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Genre: Research Report (0.82)

Industry:

Education > Curriculum > Subject-Specific Education (0.85)
Education > Educational Technology > Educational Software > Computer Based Training (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

Sharpnack, James, Mulcaire, Phoebe, Bicknell, Klinton, LaFlair, Geoff, Yancey, Kevin

arXiv.org Artificial IntelligenceSep-13-2024

Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the probability of a test taker getting the correct answer to a test item (i.e., question). Neural net extensions of these models, such as BertIRT, require specialized architectures and parameter tuning. We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools. It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model. This greatly accelerates the modeling workflow for scoring tests. We demonstrate its effectiveness by applying it to the Duolingo English Test, a high stakes, online English proficiency test. We show that the resulting model is typically more well calibrated, gets better predictive performance, and more accurate scores than existing methods (non-explanatory IRT models and explanatory IRT models like BERT-IRT). Along the way, we provide a brief survey of machine learning methods for calibration of item parameters for CATs.

autoirt, item parameter, test taker, (15 more...)

arXiv.org Artificial Intelligence

2409.08823

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(5 more...)

Genre:

Research Report (0.82)
Overview (0.68)

Industry: Education > Curriculum > Subject-Specific Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

He-Yueya, Joy, Ma, Wanjing Anya, Gandhi, Kanishk, Domingue, Benjamin W., Brunskill, Emma, Goodman, Noah D.

arXiv.org Artificial IntelligenceJul-22-2024

Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.

dataset, lms, psychometric alignment, (12 more...)

arXiv.org Artificial Intelligence

2407.15645

Country:

North America > United States > Virginia (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre:

Questionnaire & Opinion Survey (0.68)
Research Report (0.64)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)

Add feedback

$\texttt{metabench}$ -- A Sparse Benchmark to Measure General Ability in Large Language Models

Kipnis, Alex, Voudouris, Konstantinos, Buschoff, Luca M. Schulze, Schulz, Eric

arXiv.org Artificial IntelligenceJul-4-2024

Large Language Models (LLMs) vary in their abilities on a range of tasks. Initiatives such as the $\texttt{Open LLM Leaderboard}$ aim to quantify these differences with several large benchmarks (sets of test items to which an LLM can respond either correctly or incorrectly). However, high correlations within and between benchmark scores suggest that (1) there exists a small set of common underlying abilities that these benchmarks measure, and (2) items tap into redundant information and the benchmarks may thus be considerably compressed. We use data from $n > 5000$ LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with $d=28,632$ items in total). From them we distill a sparse benchmark, $\texttt{metabench}$, that has less than $3\%$ of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by yielding estimators of the underlying benchmark-specific abilities. We show that these estimators (1) can be used to reconstruct each original $\textit{individual}$ benchmark score with, on average, $1.5\%$ root mean square error (RMSE), (2) reconstruct the original $\textit{total}$ score with $0.8\%$ RMSE, and (3) have a single underlying common factor whose Spearman correlation with the total score is $r = 0.93$.

benchmark, latent ability, llm, (15 more...)

arXiv.org Artificial Intelligence

2407.12844

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > Illinois > Cook County > Evanston (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.68)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

Gaining Insights into Group-Level Course Difficulty via Differential Course Functioning

Baucks, Frederik, Schmucker, Robin, Borchers, Conrad, Pardos, Zachary A., Wiskott, Laurenz

arXiv.org Artificial IntelligenceMay-7-2024

Curriculum Analytics (CA) studies curriculum structure and student data to ensure the quality of educational programs. One desirable property of courses within curricula is that they are not unexpectedly more difficult for students of different backgrounds. While prior work points to likely variations in course difficulty across student groups, robust methodologies for capturing such variations are scarce, and existing approaches do not adequately decouple course-specific difficulty from students' general performance levels. The present study introduces Differential Course Functioning (DCF) as an Item Response Theory (IRT)-based CA methodology. DCF controls for student performance levels and examines whether significant differences exist in how distinct student groups succeed in a given course. Leveraging data from over 20,000 students at a large public university, we demonstrate DCF's ability to detect inequities in undergraduate course difficulty across student groups described by grade achievement. We compare major pairs with high co-enrollment and transfer students to their non-transfer peers. For the former, our findings suggest a link between DCF effect sizes and the alignment of course content to student home department motivating interventions targeted towards improving course preparedness. For the latter, results suggest minor variations in course-specific difficulty between transfer and non-transfer students. While this is desirable, it also suggests that interventions targeted toward mitigating grade achievement gaps in transfer students should encompass comprehensive support beyond enhancing preparedness for individual courses. By providing more nuanced and equitable assessments of academic performance and difficulties experienced by diverse student populations, DCF could support policymakers, course articulation officers, and student advisors.

course difficulty, dcf effect, student, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3657604.3662028

2406.04348

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)
(13 more...)

Genre:

Research Report > New Finding (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry: Education > Educational Setting > Higher Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Data Science > Data Mining (0.68)
(3 more...)

Add feedback