irt
Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length
Xu, Zhiyu, Liu, Jia, Wang, Yixin, Gu, Yuqi
The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to guide downstream applications and actionable future improvements. The Item Response Theory (IRT) has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose Latency-Response Theory (LaRT) to jointly model the response accuracy and CoT length by introducing the latent ability, latent speed, and a key correlation parameter between them. We derive an efficient estimation algorithm and establish rigorous identifiability results for the population parameters to ensure the statistical validity of estimation. Theoretical asymptotic analyses and simulation studies demonstrate LaRT's advantages over IRT in terms of higher estimation accuracy and shorter confidence intervals for latent traits. A key finding is that the asymptotic estimation precision of the latent ability under LaRT exceeds that of IRT whenever the latent ability and latent speed are correlated. We collect real responses from diverse LLMs on popular benchmark datasets. The application of LaRT reveals a strong negative correlation between the latent ability and latent speed in all benchmarks, with stronger correlation for more difficult benchmarks. This finding supports the intuition that higher reasoning ability correlates with slower speed and longer response latency. LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.
Optimal Foraging in Memory Retrieval: Evaluating Random Walks and Metropolis-Hastings Sampling in Modern Semantic Spaces
Human memory retrieval often resembles ecological foraging where animals search for food in a "patchy" environment. Optimal foraging means strict adherences to the Marginal V alue Thereom (MVT) in which individuals exploit a "patch" of semantically related concepts until it becomes less rewarding, then switch to a new cluster. While human behavioral data suggests foraging-like patterns in semantic fluency tasks, it is still unknown whether modern high-dimensional embedding spaces provide a sufficient representation for algorithms to closely match observed human behavior. By leveraging state-of-the-art embeddings and prior clustering and human semantic fluency data I find that random walks on these semantic embedding spaces produces results consistent with optimal foraging and the MVT. Surprisingly, introducing Metropolis-Hastings, an adaptive algorithm expected to model strategic acceptance and rejection of new clusters, does not produce results consistent with observed human behavior. These findings challenge the assumption that sophisticated sampling mechanisms inherently provide better cognitive models of memory retrieval. Instead, they highlight that appropriately structured semantic embeddings, even with minimalist sampling approaches, can produce near-optimal foraging dynamics. In doing so, my results support the perspective of Hills (2012) rather than Abbott (2015), demonstrating that modern embed-dings can approximate human memory foraging without relying on complex acceptance criteria.
Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory
Cardoso, Lucas, Santos, Vitor, Filho, José Ribeiro, Prudêncio, Ricardo, Kawasaki, Regiane, Alves, Ronnie
Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to characterize and guide the partitioning of datasets in the model validation stage. The impact of IRT-informed partitioning strategies on the performance of several ML models in four tabular datasets was evaluated. The results obtained demonstrate that IRT reveals an inherent heterogeneity of the instances and highlights the existence of informative subgroups of instances within the same dataset. Based on IRT, balanced partitions were created that consistently help to better understand the tradeoff between bias and variance of the models. In addition, the guessing parameter proved to be a determining factor: training with high-guessing instances can significantly impair model performance and resulted in cases with accuracy below 50%, while other partitions reached more than 70% in the same dataset.
Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach
Rosa, Bruno Alexandre, Oliveira, Hilário, Rodrigues, Luiz, Oliveira, Eduardo Araujo, Mello, Rafael Ferreira
Essays are considered a valuable mechanism for evaluating learning outcomes in writing. Textual cohesion is an essential characteristic of a text, as it facilitates the establishment of meaning between its parts. Automatically scoring cohesion in essays presents a challenge in the field of educational artificial intelligence. The machine learning algorithms used to evaluate texts generally do not consider the individual characteristics of the instances that comprise the analysed corpus. In this meaning, item response theory can be adapted to the context of machine learning, characterising the ability, difficulty and discrimination of the models used. This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory to adjust the scores generated by machine learning models. In this study, the corpus selected for the experiments consisted of the extended Essay-BR, which includes 6,563 essays in the style of the National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays, comprising 1,235 essays written by 5th to 9th grade students from public schools. We extracted 325 linguistic features and treated the problem as a machine learning regression task. The experimental results indicate that the proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics. This research explores a potential approach for improving the automatic evaluation of cohesion in educational essays.
Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness
Cardoso, Lucas, Santos, Vitor, Ribeiro, José, Kawasaki, Regiane, Prudêncio, Ricardo, Alves, Ronnie
Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset complexity and an algorithm's ability to generalize. Without this dual perspective, assessments may favor models that perform well on easy instances while failing to capture their true robustness. To address this limitation, this study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system, originally developed to measure player strength in competitive games. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics - such as rating, deviation, and volatility - via simulated tournaments between classifiers. This combined approach provides a fairer and more nuanced measure of algorithm capability. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging and that a reduced subset with 50% of the original datasets offers comparable evaluation power. Among the algorithms tested, Random Forest achieved the highest ability score. The results highlight the importance of improving benchmark design by focusing on dataset quality and adopting evaluation strategies that reflect both difficulty and classifier proficiency.
Investigating the Zone of Proximal Development of Language Models for In-Context Learning
In this paper, we introduce a learning analytics framework to analyze the in-context learning (ICL) behavior of large language models (LLMs) through the lens of the Zone of Proximal Development (ZPD), an established theory in educational psychology. ZPD delineates the space between what a learner is capable of doing unsupported and what the learner cannot do even with support. We adapt this concept to ICL, measuring the ZPD of LLMs based on model performance on individual examples with and without ICL. Furthermore, we propose an item response theory (IRT) model to predict the distribution of zones for LLMs. Our findings reveal a series of intricate and multifaceted behaviors of ICL, providing new insights into understanding and leveraging this technique. Finally, we demonstrate how our framework can enhance LLM in both inference and fine-tuning scenarios: (1) By predicting a model's zone of proximal development, we selectively apply ICL to queries that are most likely to benefit from demonstrations, achieving a better balance between inference cost and performance; (2) We propose a human-like curriculum for fine-tuning, which prioritizes examples within the model's ZPD. The curriculum results in improved performance, and we explain its effectiveness through an analysis of the training dynamics of LLMs.
Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models
Maura-Rivero, Roberto-Rafael, Nagpal, Chirag, Patel, Roma, Visin, Francesco
Current methods that train large language models (LLMs) with reinforcement learning feedback, often resort to averaging outputs of multiple rewards functions during training. This overlooks crucial aspects of individual reward dimensions and inter-reward dependencies that can lead to sub-optimal outcomes in generations. In this work, we show how linear aggregation of rewards exhibits some vulnerabilities that can lead to undesired properties of generated text. We then propose a transformation of reward functions inspired by economic theory of utility functions (specifically Inada conditions), that enhances sensitivity to low reward values while diminishing sensitivity to already high values. We compare our approach to the existing baseline methods that linearly aggregate rewards and show how the Inada-inspired reward feedback is superior to traditional weighted averaging. We quantitatively and qualitatively analyse the difference in the methods, and see that models trained with Inada-transformations score as more helpful while being less harmful.
Standing on the shoulders of giants
Cardoso, Lucas Felipe Ferraro, Filho, José de Sousa Ribeiro, Santos, Vitor Cirilo Araujo, Frances, Regiane Silva Kawasaki, Alves, Ronnie Cley de Oliveira
Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.