AITopics | irt

Collaborating Authors

irt

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

Xu, Zhiyu, Liu, Jia, Wang, Yixin, Gu, Yuqi

arXiv.org Machine LearningDec-12-2025

The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to guide downstream applications and actionable future improvements. The Item Response Theory (IRT) has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose Latency-Response Theory (LaRT) to jointly model the response accuracy and CoT length by introducing the latent ability, latent speed, and a key correlation parameter between them. We derive an efficient estimation algorithm and establish rigorous identifiability results for the population parameters to ensure the statistical validity of estimation. Theoretical asymptotic analyses and simulation studies demonstrate LaRT's advantages over IRT in terms of higher estimation accuracy and shorter confidence intervals for latent traits. A key finding is that the asymptotic estimation precision of the latent ability under LaRT exceeds that of IRT whenever the latent ability and latent speed are correlated. We collect real responses from diverse LLMs on popular benchmark datasets. The application of LaRT reveals a strong negative correlation between the latent ability and latent speed in all benchmarks, with stronger correlation for more difficult benchmarks. This finding supports the intuition that higher reasoning ability correlates with slower speed and longer response latency. LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.

lart, latent ability, llm, (16 more...)

arXiv.org Machine Learning

2512.07019

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Michigan (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)

Genre: Research Report (1.00)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Optimal Foraging in Memory Retrieval: Evaluating Random Walks and Metropolis-Hastings Sampling in Modern Semantic Spaces

Moore, James

arXiv.org Artificial IntelligenceNov-18-2025

Human memory retrieval often resembles ecological foraging where animals search for food in a "patchy" environment. Optimal foraging means strict adherences to the Marginal V alue Thereom (MVT) in which individuals exploit a "patch" of semantically related concepts until it becomes less rewarding, then switch to a new cluster. While human behavioral data suggests foraging-like patterns in semantic fluency tasks, it is still unknown whether modern high-dimensional embedding spaces provide a sufficient representation for algorithms to closely match observed human behavior. By leveraging state-of-the-art embeddings and prior clustering and human semantic fluency data I find that random walks on these semantic embedding spaces produces results consistent with optimal foraging and the MVT. Surprisingly, introducing Metropolis-Hastings, an adaptive algorithm expected to model strategic acceptance and rejection of new clusters, does not produce results consistent with observed human behavior. These findings challenge the assumption that sophisticated sampling mechanisms inherently provide better cognitive models of memory retrieval. Instead, they highlight that appropriately structured semantic embeddings, even with minimalist sampling approaches, can produce near-optimal foraging dynamics. In doing so, my results support the perspective of Hills (2012) rather than Abbott (2015), demonstrating that modern embed-dings can approximate human memory foraging without relying on complex acceptance criteria.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.12759

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)

Add feedback

1def1713ebf17722cbe300cfc1c88558-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 09:13:34 GMT

artificial intelligence, reviewer, tensor, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.50)

Add feedback

Blind Video Temporal Consistency via Deep Video Prior

Neural Information Processing SystemsOct-2-2025, 00:41:35 GMT

Applying image processing algorithms independently to each video frame often leads to temporal inconsistency in the resulting video. To address this issue, we present a novel and general approach for blind video temporal consistency.

artificial intelligence, machine learning, temporal consistency, (19 more...)

Neural Information Processing Systems

Country: North America (0.28)

Genre: Research Report (0.94)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory

Cardoso, Lucas, Santos, Vitor, Filho, José Ribeiro, Prudêncio, Ricardo, Kawasaki, Regiane, Alves, Ronnie

arXiv.org Artificial IntelligenceAug-15-2025

Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to characterize and guide the partitioning of datasets in the model validation stage. The impact of IRT-informed partitioning strategies on the performance of several ML models in four tabular datasets was evaluated. The results obtained demonstrate that IRT reveals an inherent heterogeneity of the instances and highlights the existence of informative subgroups of instances within the same dataset. Based on IRT, balanced partitions were created that consistently help to better understand the tradeoff between bias and variance of the models. In addition, the guessing parameter proved to be a determining factor: training with high-guessing instances can significantly impair model performance and resulted in cases with accuracy below 50%, while other partitions reached more than 70% in the same dataset.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2508.10628

Country: South America > Brazil (0.47)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.95)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach

Rosa, Bruno Alexandre, Oliveira, Hilário, Rodrigues, Luiz, Oliveira, Eduardo Araujo, Mello, Rafael Ferreira

arXiv.org Artificial IntelligenceJul-14-2025

Essays are considered a valuable mechanism for evaluating learning outcomes in writing. Textual cohesion is an essential characteristic of a text, as it facilitates the establishment of meaning between its parts. Automatically scoring cohesion in essays presents a challenge in the field of educational artificial intelligence. The machine learning algorithms used to evaluate texts generally do not consider the individual characteristics of the instances that comprise the analysed corpus. In this meaning, item response theory can be adapted to the context of machine learning, characterising the ability, difficulty and discrimination of the models used. This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory to adjust the scores generated by machine learning models. In this study, the corpus selected for the experiments consisted of the extended Essay-BR, which includes 6,563 essays in the style of the National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays, comprising 1,235 essays written by 5th to 9th grade students from public schools. We extracted 325 linguistic features and treated the problem as a machine learning regression task. The experimental results indicate that the proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics. This research explores a potential approach for improving the automatic evaluation of cohesion in educational essays.

cohesion, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.08487

Country: South America > Brazil (0.68)

Genre: Research Report > New Finding (0.88)

Industry:

Education > Educational Setting > K-12 Education > Secondary School (0.48)
Education > Educational Technology > Educational Software (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness

Cardoso, Lucas, Santos, Vitor, Ribeiro, José, Kawasaki, Regiane, Prudêncio, Ricardo, Alves, Ronnie

arXiv.org Artificial IntelligenceApr-15-2025

Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset complexity and an algorithm's ability to generalize. Without this dual perspective, assessments may favor models that perform well on easy instances while failing to capture their true robustness. To address this limitation, this study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system, originally developed to measure player strength in competitive games. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics - such as rating, deviation, and volatility - via simulated tournaments between classifiers. This combined approach provides a fairer and more nuanced measure of algorithm capability. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging and that a reduced subset with 50% of the original datasets offers comparable evaluation power. Among the algorithms tested, Random Forest achieved the highest ability score. The results highlight the importance of improving benchmark design by focusing on dataset quality and adopting evaluation strategies that reflect both difficulty and classifier proficiency.

artificial intelligence, decision tree learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2504.09759

Country:

South America > Brazil (0.28)
North America > United States (0.28)

Genre:

Research Report (1.00)
Workflow (0.68)

Industry:

Education (0.67)
Leisure & Entertainment > Games (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.34)

Add feedback

Investigating the Zone of Proximal Development of Language Models for In-Context Learning

Cui, Peng, Sachan, Mrinmaya

arXiv.org Artificial IntelligenceFeb-10-2025

In this paper, we introduce a learning analytics framework to analyze the in-context learning (ICL) behavior of large language models (LLMs) through the lens of the Zone of Proximal Development (ZPD), an established theory in educational psychology. ZPD delineates the space between what a learner is capable of doing unsupported and what the learner cannot do even with support. We adapt this concept to ICL, measuring the ZPD of LLMs based on model performance on individual examples with and without ICL. Furthermore, we propose an item response theory (IRT) model to predict the distribution of zones for LLMs. Our findings reveal a series of intricate and multifaceted behaviors of ICL, providing new insights into understanding and leveraging this technique. Finally, we demonstrate how our framework can enhance LLM in both inference and fine-tuning scenarios: (1) By predicting a model's zone of proximal development, we selectively apply ICL to queries that are most likely to benefit from demonstrations, achieving a better balance between inference cost and performance; (2) We propose a human-like curriculum for fine-tuning, which prioritizes examples within the model's ZPD. The curriculum results in improved performance, and we explain its effectiveness through an analysis of the training dynamics of LLMs.

demonstration, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.0699

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(5 more...)

Genre: Research Report > New Finding (0.66)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

Maura-Rivero, Roberto-Rafael, Nagpal, Chirag, Patel, Roma, Visin, Francesco

arXiv.org Artificial IntelligenceJan-8-2025

Current methods that train large language models (LLMs) with reinforcement learning feedback, often resort to averaging outputs of multiple rewards functions during training. This overlooks crucial aspects of individual reward dimensions and inter-reward dependencies that can lead to sub-optimal outcomes in generations. In this work, we show how linear aggregation of rewards exhibits some vulnerabilities that can lead to undesired properties of generated text. We then propose a transformation of reward functions inspired by economic theory of utility functions (specifically Inada conditions), that enhances sensitivity to low reward values while diminishing sensitivity to already high values. We compare our approach to the existing baseline methods that linearly aggregate rewards and show how the Inada-inspired reward feedback is superior to traditional weighted averaging. We quantitatively and qualitatively analyse the difference in the methods, and see that models trained with Inada-transformations score as more helpful while being less harmful.

large language model, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2501.06248

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Standing on the shoulders of giants

Cardoso, Lucas Felipe Ferraro, Filho, José de Sousa Ribeiro, Santos, Vitor Cirilo Araujo, Frances, Regiane Silva Kawasaki, Alves, Ronnie Cley de Oliveira

arXiv.org Machine LearningSep-6-2024

Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.

confusion matrix, dataset, discrimination, (16 more...)

arXiv.org Machine Learning

2409.03151

Country:

South America > Brazil > Pará > Belém (0.04)
Europe > France (0.04)
North America > United States > District of Columbia > Washington (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback