Goto

Collaborating Authors

 fluid intelligence


Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.


Bridging Foundation Models and Efficient Architectures: A Modular Brain Imaging Framework with Local Masking and Pretrained Representation Learning

arXiv.org Artificial Intelligence

Functional connectivity (FC) derived from resting-state fMRI plays a critical role in personalized predictions such as age and cognitive performance. However, applying foundation models(FM) to fMRI data remains challenging due to its high dimensionality, computational complexity, and the difficulty in capturing complex spatiotemporal dynamics and indirect region-of-interest (ROI) interactions. To address these limitations, we propose a modular neuroimaging framework that integrates principles from FM with efficient, domain-specific architectures. Our approach begins with a Local Masked Au-toencoder (LMAE) for pretraining, which reduces the influence of hemodynamic response function (HRF) dynamics and suppresses noise. This is followed by a Random Walk Mixture of Experts (RWMOE) module that clusters features across spatial and temporal dimensions, effectively capturing intricate brain interactions. Finally, a state-space model (SSM)-based predictor performs downstream task inference. Evaluated on the Cambridge Centre for Ageing and Neuroscience (Cam-CAN) dataset, our framework achieved mean absolute errors (MAEs) of 5.343 for age prediction and 2.940 for fluid intelligence, with Pearson correlation coefficients (PCCs) of 0.928 and 0.887, respectively--outperforming existing state-of-the-art methods. Visualization of expert distribution weights further enhances interpretability by identifying key brain regions. This work provides a robust, interpretable alternative to LLM-based approaches for fMRI analysis, offering novel insights into brain aging and cognitive function.


The Man Out to Prove How Dumb AI Still Is

The Atlantic - Technology

They want to build AI models that achieve "artificial general intelligence," or AGI--matching or exceeding the capabilities of the human mind. The difference between these two men is that Altman has suggested that his company, OpenAI, has practically built the technology already. Chollet, a French computer scientist and one of the industry's sharpest skeptics, has said that notion is "absolutely clown shoes." When I spoke with him earlier this year, Chollet told me that AI companies have long been "intellectually lazy" in suggesting that their machines are on the path to a kind of supreme knowledge. At this point, those claims are based largely on the programs' ability to pass specific tests (such as the LSAT, Advanced Placement Biology, and even an introductory sommelier exam).


Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task

arXiv.org Artificial Intelligence

While LLMs have exhibited strong performance on various NLP tasks, it is noteworthy that most of these tasks rely on utilizing the vast amount of knowledge encoded in LLMs' parameters, rather than solving new problems without prior knowledge. In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs' abilities. In this paper, we analyze the challenges LLMs face in demonstrating fluid intelligence through controlled experiments, using the most representative ARC task as an example. Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding. Our data and code can be found in https://wujunjie1998.github.io/araoc-benchmark.github.io/.


Individual Text Corpora Predict Openness, Interests, Knowledge and Level of Education

arXiv.org Artificial Intelligence

Here we examine whether the personality dimension of openness to experience can be predicted from the individual google search history. By web scraping, individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens. We trained word2vec models and used the similarities of each IC to label words, which were derived from a lexical approach of personality. These IC-label-word similarities were utilized as predictive features in neural models. For training and validation, we relied on 179 participants and held out a test sample of 35 participants. A grid search with varying number of predictive features, hidden units and boost factor was performed. As model selection criterion, we used R2 in the validation samples penalized by the absolute R2 difference between training and validation. The selected neural model explained 35% of the openness variance in the test sample, while an ensemble model with the same architecture often provided slightly more stable predictions for intellectual interests, knowledge in humanities and level of education. Finally, a learning curve analysis suggested that around 500 training participants are required for generalizable predictions. We discuss ICs as a complement or replacement of survey-based psychodiagnostics.


Integration of cognitive tasks into artificial general intelligence test for large models

arXiv.org Artificial Intelligence

During the evolution of large models, performance evaluation is necessarily performed on the intermediate models to assess their capabilities, and on the well-trained model to ensure safety before practical application. However, current model evaluations mainly rely on specific tasks and datasets, lacking a united framework for assessing the multidimensional intelligence of large models. In this perspective, we advocate for a comprehensive framework of artificial general intelligence (AGI) test, aimed at fulfilling the testing needs of large language models and multi-modal large models with enhanced capabilities. The AGI test framework bridges cognitive science and natural language processing to encompass the full spectrum of intelligence facets, including crystallized intelligence, a reflection of amassed knowledge and experience; fluid intelligence, characterized by problem-solving and adaptive reasoning; social intelligence, signifying comprehension and adaptation within multifaceted social scenarios; and embodied intelligence, denoting the ability to interact with its physical environment. To assess the multidimensional intelligence of large models, the AGI test consists of a battery of well-designed cognitive tests adopted from human intelligence tests, and then naturally encapsulates into an immersive virtual community. We propose that the complexity of AGI testing tasks should increase commensurate with the advancements in large models. We underscore the necessity for the interpretation of test results to avoid false negatives and false positives. We believe that cognitive science-inspired AGI tests will effectively guide the targeted improvement of large models in specific dimensions of intelligence and accelerate the integration of large models into human society.


The minimal computational substrate of fluid intelligence

arXiv.org Artificial Intelligence

The quantification of cognitive powers rests on identifying a behavioural task that depends on them. Such dependence cannot be assured, for the powers a task invokes cannot be experimentally controlled or constrained a priori, resulting in unknown vulnerability to failure of specificity and generalisability. Evaluating a compact version of Raven's Advanced Progressive Matrices (RAPM), a widely used clinical test of fluid intelligence, we show that LaMa, a self-supervised artificial neural network trained solely on the completion of partially masked images of natural environmental scenes, achieves human-level test scores a prima vista, without any task-specific inductive bias or training. Compared with cohorts of healthy and focally lesioned participants, LaMa exhibits human-like variation with item difficulty, and produces errors characteristic of right frontal lobe damage under degradation of its ability to integrate global spatial patterns. LaMa's narrow training and limited capacity -- comparable to the nervous system of the fruit fly -- suggest RAPM may be open to computationally simple solutions that need not necessarily invoke abstract reasoning.


Naive Few-Shot Learning: Uncovering the fluid intelligence of machines

arXiv.org Artificial Intelligence

In this paper, we aimed to help bridge the gap between human fluid intelligence - the ability to solve novel tasks without prior training - and the performance of deep neural networks, which typically require extensive prior training. An essential cognitive component for solving intelligence tests, which in humans are used to measure fluid intelligence, is the ability to identify regularities in sequences. This motivated us to construct a benchmark task, which we term \textit{sequence consistency evaluation} (SCE), whose solution requires the ability to identify regularities in sequences. Given the proven capabilities of deep networks, their ability to solve such tasks after extensive training is expected. Surprisingly, however, we show that naive (randomly initialized) deep learning models that are trained on a \textit{single} SCE with a \textit{single} optimization step can still solve non-trivial versions of the task relatively well. We extend our findings to solve, without any prior training, real-world anomaly detection tasks in the visual and auditory modalities. These results demonstrate the fluid-intelligent computational capabilities of deep networks. We discuss the implications of our work for constructing fluid-intelligent machines.


Novel deep learning method may help predict cognitive function

#artificialintelligence

Northwestern investigators have developed a deep learning-based method that can predict cognitive function capacity based on brain shape and structure, detailed in a study published in Scientific Reports. The method, which uses graph convolutional neural networks (gCNNs), may also reveal new insights into the relationship between brain morphology and different cognitive functions as well as the decline of brain function. "When we apply the rich capabilities of CNNs to graph representation of the brain, we can explore the brain as an image in a previously unexplored way," said S. Kathleen Bandt, MD, assistant professor of Neurological Surgery and a co-author of the study. Understanding how the relationship between brain structure and cognitive function changes throughout the life course has remained elusive. However, previous work suggests that fluid intelligence--the ability to problem solve and think and reason abstractly--depends heavily on two regions of the brain: the prefrontal cortex and parietal cortex, both of which are involved in decision-making and sensory perception, among other functions.


Population modeling with machine learning can enhance measures of mental health

#artificialintelligence

Figure 1 – Figure supplement 1: Learning curves on the random split-half validation used for model building. To facilitate comparisons, we evaluated predictions of age, fluid intelligence and neuroticism from a complete set of socio-demographic variables without brain imaging using the coefficient of determination R2 metric (y-axis) to compare results obtained from 100 to 3000 training samples (x-axis). The cross-validation (CV) distribution was obtained from 100 Monte Carlo splits. Across targets, performance started to plateau after around 1000 training samples with scores virtually identical to the final model used in subsequent analyses. These benchmarks suggest that inclusion of additional training samples would not have led to substantial improvements in performance.