Goto

Collaborating Authors

 subtest


A Machine Learning-Based Framework to Shorten the Questionnaire for Assessing Autism Intervention

arXiv.org Artificial Intelligence

Caregivers of individuals with autism spectrum disorder (ASD) often find the 77-item Autism Treatment Evaluation Checklist (ATEC) burdensome, limiting its use for routine monitoring. This study introduces a generalizable machine learning framework that seeks to shorten assessments while maintaining evaluative accuracy. Using longitudinal ATEC data from 60 autistic children receiving therapy, we applied feature selection and cross-validation techniques to identify the most predictive items across two assessment goals: longitudinal therapy tracking and point-in-time severity estimation. For progress monitoring, the framework identified 16 items (21% of the original questionnaire) that retained strong correlation with total score change and full subdomain coverage. We also generated smaller subsets (1-7 items) for efficient approximations. For point-in-time severity assessment, our model achieved over 80% classification accuracy using just 13 items (17% of the original set). While demonstrated on ATEC, the methodology-based on subset optimization, model interpretability, and statistical rigor-is broadly applicable to other high-dimensional psychometric tools. The resulting framework could potentially enable more accessible, frequent, and scalable assessments and offer a data-driven approach for AI-supported interventions across neurodevelopmental and psychiatric contexts.


Efficiency Without Cognitive Change: Evidence from Human Interaction with Narrow AI Systems

arXiv.org Artificial Intelligence

The growing integration of artificial intelligence (AI) into human cognition raises a fundamental question: does AI merely improve efficiency, or does it alter how we think? This study experimentally tested whether short-term exposure to narrow AI tools enhances core cognitive abilities or simply optimizes task performance. Thirty young adults completed standardized neuropsychological assessments embedded in a seven-week protocol with a four-week online intervention involving problem-solving and verbal comprehension tasks, either with or without AI support (ChatGPT). While AI-assisted participants completed several tasks faster and more accurately, no significant pre-post differences emerged in standardized measures of problem solving or verbal comprehension. These results demonstrate efficiency gains without cognitive change, suggesting that current narrow AI systems serve as cognitive scaffolds extending performance without transforming underlying mental capacities. The findings highlight the need for ethical and educational frameworks that promote critical and autonomous thinking in an increasingly AI-augmented cognitive ecology.


Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

arXiv.org Machine Learning

Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.


Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

arXiv.org Artificial Intelligence

Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments. The manuscript has been accepted for presentation at the 27th International Conference on Human-Computer Interaction in Gothenburg, Sweden, from June 22-27, 2025.


The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks

arXiv.org Artificial Intelligence

There is increasing interest in tracking the capabilities of general intelligence foundation models. This study benchmarks leading large language models and vision language models against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV), a comprehensive, population-normed assessment of underlying human cognition and intellectual abilities, with a focus on the domains of VerbalComprehension (VCI), Working Memory (WMI), and Perceptual Reasoning (PRI). Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers, with performance on the Working Memory Index (WMI) greater or equal to the 99.5th percentile when compared to human population normative ability. Performance on the Verbal Comprehension Index (VCI) which measures retrieval of acquired information, and linguistic understanding about the meaning of words and their relationships to each other, also demonstrated consistent performance at or above the 98th percentile. Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI; range 0.1-10th percentile) from multimodal models indicating profound inability to interpret and reason on visual information. Smaller and older model versions consistently performed worse, indicating that training data, parameter count and advances in tuning are resulting in significant advances in cognitive ability.


Early Stopping Based on Repeated Significance

arXiv.org Artificial Intelligence

For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a $p$-value less than a specified value of $\alpha$ for the success criterion produces statistical confidence at level $1 - \alpha$. For multiple criteria, a Bonferroni correction that partitions $\alpha$ among the criteria produces statistical confidence, at the cost of requiring lower $p$-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for $p$-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points.


Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach

arXiv.org Artificial Intelligence

This study uncovers the factor of general intelligence, or g, in language models, extending the psychometric theory traditionally applied to humans and certain animal species. Utilizing factor analysis on two extensive datasets - Open LLM Leaderboard with 1,232 models and General Language Understanding Evaluation (GLUE) Leaderboard with 88 models - we find compelling evidence for a unidimensional, highly stable g factor that accounts for 85% of the variance in model performance. The study also finds a moderate correlation of .49 between model size and g. The discovery of g in language models offers a unified metric for model evaluation and opens new avenues for more robust, g-based model ability assessment. These findings lay the foundation for understanding and future research on artificial general intelligence from a psychometric perspective and have practical implications for model evaluation and development.


Detection of developmental language disorder in Cypriot Greek children using a machine learning neural network algorithm

arXiv.org Artificial Intelligence

Children with developmental language disorder (DLD) encounter difficulties in acquiring various language structures. Early identification and intervention are crucial to prevent negative long-term outcomes impacting the academic, social, and emotional development of children. The study aims to develop an automated method for the identification of DLD using artificial intelligence, specifically a neural network machine learning algorithm. This protocol is applied for the first time in a Cypriot Greek child population with DLD. The neural network model was trained using perceptual and production data elicited from 15 children with DLD and 15 healthy controls in the age range of 7;10 - 10;4. The k-fold technique was used to crossvalidate the algorithm. The performance of the model was evaluated using metrics such as accuracy, precision, recall, F1 score, and ROC/AUC curve to assess its ability to make accurate predictions on a set of unseen data. The results demonstrated high classification values for all metrics, indicating the high accuracy of the neural model in classifying children with DLD. Additionally, the variable importance analysis revealed that the language production skills of children had a more significant impact on the performance of the model compared to perception skills. Machine learning paradigms provide effective discrimination between children with DLD and those with TD, with the potential to enhance clinical assessment and facilitate earlier and more efficient detection of the disorder.


Thinking in PolAR Pictures: Using Rotation-Friendly Mental Images to Solve Leiter-R Form Completion

AAAI Conferences

The Leiter International Performance Scale-Revised (Leiter-R) is a standardized cognitive test that seeks to "provide a nonverbal measure of general intelligence by sampling a wide variety of functions from memory to nonverbal reasoning." Understanding the computational building blocks of nonverbal cognition, as measured by the Leiter-R, is an important step towards understanding human nonverbal cognition, especially with respect to typical and atypical trajectories of child development. One subtest of the Leiter-R, Form Completion, involves synthesizing and localizing a visual figure from its constituent slices. Form Completion poses an interesting nonverbal problem that seems to combine several aspects of visual memory, mental rotation, and visual search. We describe a new computational cognitive model that addresses Form Completion using a novel, mental-rotation-friendly image representation that we call the Polar Augmented Resolution (PolAR) Picture, which enables high-fidelity mental rotation operations. We present preliminary results using actual Leiter-R test items and discuss directions for future work.


An Approach to Evaluate AI Commonsense Reasoning Systems

AAAI Conferences

We propose and give a preliminary test of a new metric for the quality of the commonsense knowledge and reasoning of large AI databases: Using the same measurement as is used for a four-year-old, namely, an IQ test for young children. We report on results obtained us- ing test questions we wrote in the spirit of the questions of the Wechsler Preschool and Primary Scale of Intelligence, Third Edition (WPPSI-III) on the ConceptNet system, which were, on the whole, quite strong.