cognitive level
1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_and_Benchmarks_Track.pdf
Therearetwoupdatingstrategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy ofeducational objectives. Extensiveexperiments onupdated MMLU andBIG-Bench demonstrate thestability oftheproposed strategiesandfindthat the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategystill showspromising results.
- Asia > China (0.04)
- North America > United States > Colorado > Weld County > Evans (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)
EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
Ma, Guoqing, Zhu, Jia, Guo, Hanghui, Shi, Weijie, Cui, Yue, Shen, Jiawei, Li, Zilong, Liang, Yidan
Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom's Taxonomy and Webb's Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.
The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data
Shang, Zhaoyang, Wei, Sibo, Guo, Jianbin, Zhou, Rui, Dong, Lifeng, Luo, Yin
Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs' internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on https://github.com/DYJG-research/THTB.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Asia > China > Beijing > Beijing (0.04)
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy of educational objectives.
- North America > United States > Mississippi (0.04)
- Asia > Singapore (0.04)
- North America > United States > Colorado > Weld County > Evans (0.04)
- (3 more...)
- Education (0.88)
- Information Technology (0.67)
- Leisure & Entertainment > Sports > Basketball (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.92)
Analyzing Information-Seeking Behaviors in a Hakka AI Chatbot: A Cognitive-Pragmatic Study
Lee, Chu-Hsuan, Chang, Chen-Chi, Lee, Hung-Shin, Hsu, Yun-Hsiang, Chen, Ching-Yuan
With many endangered languages at risk of disappearing, efforts to preserve them now rely more than ever on using technology alongside culturally informed teaching strategies. This study examines user behaviors in TALKA, a generative AI-powered chatbot designed for Hakka language engagement, by employing a dual-layered analytical framework grounded in Bloom's Taxonomy of cognitive processes and dialogue act categorization. We analyzed 7,077 user utterances, each carefully annotated according to six cognitive levels and eleven dialogue act types. These included a variety of functions, such as asking for information, requesting translations, making cultural inquiries, and using language creatively. Pragmatic classifications further highlight how different types of dialogue acts--such as feedback, control commands, and social greetings--align with specific cognitive intentions. The results suggest that generative AI chatbots can support language learning in meaningful ways--especially when they are designed with an understanding of how users think and communicate. They may also help learners express themselves more confidently and connect with their cultural identity. The TALKA case provides empirical insights into how AI-mediated dialogue facilitates cognitive development in low-resource language learners, as well as pragmatic negotiation and socio-cultural affiliation. By focusing on AI-assisted language learning, this study offers new insights into how technology can support language preservation and educational practice.
- Asia > Taiwan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Singapore (0.04)
- Europe > Italy > Veneto > Venice (0.04)
Lightweight Prompt Engineering for Cognitive Alignment in Educational AI: A OneClickQuiz Case Study
Yaacoub, Antoun, Assaghir, Zainab, Da-Rugna, Jérôme
The rapid integration of Artificial Intelligence (AI) into educational technology promises to revolutionize content creation and assessment. However, the quality and pedagogical alignment of AI-generated content remain critical challenges. This paper investigates the impact of lightweight prompt engineering strategies on the cognitive alignment of AI-generated questions within OneClickQuiz, a Moodle plugin leveraging generative AI. We evaluate three prompt variants-a detailed baseline, a simpler version, and a persona-based approach-across Knowledge, Application, and Analysis levels of Bloom's Taxonomy. Utilizing an automated classification model (from prior work) and human review, our findings demonstrate that explicit, detailed prompts are crucial for precise cognitive alignment. While simpler and persona-based prompts yield clear and relevant questions, they frequently misalign with intended Bloom's levels, generating outputs that are either too complex or deviate from the desired cognitive objective. This study underscores the importance of strategic prompt engineering in fostering pedagogically sound AI-driven educational solutions and advises on optimizing AI for quality content generation in learning analytics and smart learning environments.
- Europe > Croatia > Varaždin County > Varaždin (0.05)
- Europe > Switzerland (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation
Yang, Yue, Chen, MingKang, Liu, Qihua, Hu, Mengkang, Chen, Qiguang, Zhang, Gengrui, Hu, Shuyue, Zhai, Guangtao, Qiao, Yu, Wang, Yu, Shao, Wenqi, Luo, Ping
Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > France (0.04)
- (2 more...)
Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation
Wang, Qingsong, Wu, Tao, Lin, Wang, Feng, Yueying, Yuan, Gongsheng, Yao, Chang, Chen, Jingyuan
Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentation-style misalignment, where the structure or tone hinders effective comprehension. To address these challenges, we propose the Cognitive-Level Alignment Framework (CLAF), a general-purpose generation framework that aligns both knowledge complexity and presentation style with user cognition. CLAF integrates a capability-aware retrieval module based on a hierarchical knowledge graph and a style optimization module guided by Bloom's taxonomy and preference learning. Additionally, a knowledge-controllable generation component ensures consistency and relevance throughout the output. To support training and evaluation, we construct SCALE, a cognitively annotated dataset containing responses at multiple comprehension levels per query. Empirical results show that CLAF enhances the adaptability and informativeness of LLM outputs across a range of user profiles, offering a robust solution to cognitive-level alignment in real-world applications.
- Education > Educational Setting > K-12 Education (0.94)
- Education > Educational Technology > Educational Software > Computer Based Training (0.46)
Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment
Yaacoub, Antoun, Da-Rugna, Jérôme, Assaghir, Zainab
This study evaluates the integration of Bloom's Taxonomy into OneClickQuiz, an Artificial Intelligence (AI) driven plugin for automating Multiple-Choice Question (MCQ) generation in Moodle. Bloom's Taxonomy provides a structured framework for categorizing educational objectives into hierarchical cognitive levels. Our research investigates whether incorporating this taxonomy can improve the alignment of AI-generated questions with specific cognitive objectives. We developed a dataset of 3691 questions categorized according to Bloom's levels and employed various classification models-Multinomial Logistic Regression, Naive Bayes, Linear Support Vector Classification (SVC), and a Transformer-based model (DistilBERT)-to evaluate their effectiveness in categorizing questions. Our results indicate that higher Bloom's levels generally correlate with increased question length, Flesch-Kincaid Grade Level (FKGL), and Lexical Density (LD), reflecting the increased complexity of higher cognitive demands. Multinomial Logistic Regression showed varying accuracy across Bloom's levels, performing best for "Knowledge" and less accurately for higher-order levels. Merging higher-level categories improved accuracy for complex cognitive tasks. Naive Bayes and Linear SVC also demonstrated effective classification for lower levels but struggled with higher-order tasks. DistilBERT achieved the highest performance, significantly improving classification of both lower and higher-order cognitive levels, achieving an overall validation accuracy of 91%. This study highlights the potential of integrating Bloom's Taxonomy into AI-driven assessment tools and underscores the advantages of advanced models like DistilBERT for enhancing educational content generation.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Switzerland (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education > Assessment & Standards (1.00)
- Education > Educational Technology (0.94)
- Education > Educational Setting > K-12 Education (0.67)
- Education > Curriculum > Subject-Specific Education (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- (2 more...)
ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students' Cognitive Abilities
Dong, Wenhan, Sun, Zhen, Zhao, Yuemeng, Peng, Zifan, Wu, Jun, Zheng, Jingyi, Liu, Yule, He, Xinlei, Wang, Yu, Wang, Ruiming, Huang, Xinyi, Mo, Lei
Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students' developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students' Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs' ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.
- Asia > China > Guangdong Province > Guangzhou (0.04)
- North America > United States (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Education > Educational Setting > K-12 Education (1.00)
- Education > Assessment & Standards > Student Performance (0.70)