Education
Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning
Wang, Yufei, Kovashka, Adriana, Fernández, Loretta, Coutanche, Marc N., Wiener, Seth
We investigate a new setting for foreign language learning, where learners infer the meaning of unfamiliar words in a multimodal context of a sentence describing a paired image. We conduct studies with human participants using different image-text pairs. We analyze the features of the data (i.e., images and texts) that make it easier for participants to infer the meaning of a masked or unfamiliar word, and what language backgrounds of the participants correlate with success. We find only some intuitive features have strong correlations with participant performance, prompting the need for further investigating of predictive features for success in these tasks. We also analyze the ability of AI systems to reason about participant performance, and discover promising future directions for improving this reasoning ability.
InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
Chen, Qiaosheng, Liu, Yang, Li, Lei, Chen, Kai, Guo, Qipeng, Cheng, Gong, Yuan, Fei
Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific demonstrations are particularly valuable for explaining concepts, supporting new teaching methods, and presenting research findings. Generating such demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code that behaves correctly and responds to user actions. This capability goes beyond the scope of existing benchmarks, which typically evaluate either knowledge question answering without grounding in code or static web code generation without scientific interactivity. To evaluate this integrated ability, we design a hybrid framework that combines programmatic functional testing to rigorously verify interaction logic with visually-grounded qualitative testing to assess rendered outputs against reference snapshots. Building on this framework, we present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains, each paired with unit tests, reference snapshots, and checklists. We evaluate 30 leading open- and closed-source LLMs and report results that highlight ongoing weaknesses in integrating domain knowledge with interactive front-end coding. Our work positions InteractScience as the first benchmark to automatically measure this combined capability with realistic interactive operations, providing a foundation for advancing reliable and educationally useful scientific demonstration code generation. All code and data are publicly available at https://github.com/open-compass/InteractScience.
AI in Computational Thinking Education in Higher Education: A Systematic Literature Review
Rahimi, Ebrahim, Maathuis, Clara
Computational Thinking (CT) is a key skill set for students in higher education to thrive and adapt to an increasingly technology-driven future and workplace. While research on CT education has gained remarkable momentum in K12 over the past decade, it has remained under-explored in higher education, leaving higher education teachers with an insufficient overview, knowledge, and support regarding CT education. The proliferation and adoption of artificial intelligence (AI) by educational institutions have demonstrated promising potential to support instructional activities across many disciplines, including CT education. However, a comprehensive overview outlining the various aspects of integrating AI in CT education in higher education is lacking. To mitigate this gap, we conducted this systematic literature review study. The focus of our study is to identify initiatives applying AI in CT education within higher education and to explore various educational aspects of these initiatives, including the benefits and challenges of AI in CT education, instructional strategies employed, CT components covered, and AI techniques and models utilized. This study provides practical and scientific contributions to the CT education community, including an inventory of AI-based initiatives for CT education useful to educators, an overview of various aspects of integrating AI into CT education such as its benefits and challenges (e.g., AI potential to reshape CT education versus its potential to diminish students creativity) and insights into new and expanded perspectives on CT in light of AI (e.g., the decoding approach alongside the coding approach to CT).
Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation
Zhou, Wei, Ma, Bolei, Friedrich, Annemarie, Mesgar, Mohsen
Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.
Bias-Aware AI Chatbot for Engineering Advising at the University of Maryland A. James Clark School of Engineering
Kartholy, Prarthana P., Labor, Thandi M., Panchal, Neil N., Wang, Sean H., Owusu, Hillary N.
Selecting a college major is a difficult decision for many incoming freshmen. Traditional academic advising is often hindered by long wait times, intimidating environments, and limited personalization. AI Chatbots present an opportunity to address these challenges. However, AI systems also have the potential to generate biased responses, prejudices related to race, gender, socioeconomic status, and disability. These biases risk turning away potential students and undermining reliability of AI systems. This study aims to develop a University of Maryland (UMD) A. James Clark School of Engineering Program-specific AI chatbot. Our research team analyzed and mitigated potential biases in the responses. Through testing the chatbot on diverse student queries, the responses are scored on metrics of accuracy, relevance, personalization, and bias presence. The results demonstrate that with careful prompt engineering and bias mitigation strategies, AI chatbots can provide high-quality, unbiased academic advising support, achieving mean scores of 9.76 for accuracy, 9.56 for relevance, and 9.60 for personalization with no stereotypical biases found in the sample data. However, due to the small sample size and limited timeframe, our AI model may not fully reflect the nuances of student queries in engineering academic advising. Regardless, these findings will inform best practices for building ethical AI systems in higher education, offering tools to complement traditional advising and address the inequities faced by many underrepresented and first-generation college students.
The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Keleş, Onur, Özyürek, Aslı, Ortega, Gerardo, Gökgöz, Kadir, Ghaleb, Esam
Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking
Zhang, Xinliang Frederick, Mohananey, Anhad, Chronopoulou, Alexandra, Papalampidi, Pinelopi, Gupta, Somit, Munkhdalai, Tsendsuren, Wang, Lu, Upadhyay, Shyam
Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.
FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Wu, Haotian, Jiang, Shufan, Chen, Mingyu, Feng, Yiyang, Lin, Hehai, Zou, Heqing, Shu, Yao, Qin, Chengwei
As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.
Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests
Fu, Yanbin, Jiao, Hong, Zhou, Tianyi, Zhang, Nan, Li, Ming, Xu, Qingshu, Peters, Sydney, Lissitz, Robert W.
Yanbin Fu, Hong Jiao, Tianyi Zhou, Nan Zhang, Ming Li, Qingshu Xu, Sydney Peters, Robert W. Lissitz University of Maryland, College Park Abstract Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time - consuming. This study investigated the performance of fine - tuned small language models (SLMs) for automated item alignment using data from a large - scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size inc rease alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual - E5 - lar ge - instruct model. The study results showed that fine - tuned SLMs consistently outperformed the embedding - based supervised machine learning models, particularly for the more fine - grained skill alignment. To better understand model mis classifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback - Leibler divergence of embedding distributions, and two - dimension projections of item embeddings were conducted.
"A 6 or a 9?": Ensemble Learning Through the Multiplicity of Performant Models and Explanations
Zuin, Gianlucca, Veloso, Adriano
Creating models from past observations and ensuring their effectiveness on new data is the essence of machine learning. However, selecting models that generalize well remains a challenging task. Related to this topic, the Rashomon Effect refers to cases where multiple models perform similarly well for a given learning problem. This often occurs in real-world scenarios, like the manufacturing process or medical diagnosis, where diverse patterns in data lead to multiple high-performing solutions. We propose the Rashomon Ensemble, a method that strategically selects models from these diverse high-performing solutions to improve generalization. By grouping models based on both their performance and explanations, we construct ensembles that maximize diversity while maintaining predictive accuracy. This selection ensures that each model covers a distinct region of the solution space, making the ensemble more robust to distribution shifts and variations in unseen data. We validate our approach on both open and proprietary collaborative real-world datasets, demonstrating up to 0.20+ AUROC improvements in scenarios where the Rashomon ratio is large. Additionally, we demonstrate tangible benefits for businesses in various real-world applications, highlighting the robustness, practicality, and effectiveness of our approach.