Education
AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations
Gong, Litian, Bahrani, Fatemeh, Zhou, Yutai, Banayeeanzade, Amin, Li, Jiachen, Bıyık, Erdem
AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.
DiverseClaire: Simulating Students to Improve Introductory Programming Course Materials for All CS1 Learners
Wong, Wendy, Jiang, Yuchao, Li, Yuekang
Although CS programs are booming, introductory courses like CS1 still adopt a one-size-fits-all formats that can exacerbate cognitive load and discourage learners with autism, ADHD, dyslexia and other neurological conditions. These call for compassionate pedagogies and Universal Design For Learning (UDL) to create learning environments and materials where cognitive diversity is welcomed. To address this, we introduce DiverseClaire a pilot study, which simulates students including neurodiverse profiles using LLMs and diverse personas. By leveraging Bloom's Taxonomy and UDL, DiverseClaire compared UDL-transformed lecture slides with traditional formats. To evaluate DiverseClaire controlled experiments, we used the evaluation metric the average score. The findings revealed that the simulated neurodiverse students struggled with learning due to lecture slides that were in inaccessible formats. These results highlight the need to provide course materials in multiple formats for diverse learner preferences. Data from our pilot study will be made available to assist future CS1 instructors.
FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning
Li, Yunbo, Gui, Jiaping, Deng, Zhihang, Meng, Fanchao, Wu, Yue
Federated learning (FL) enables collaborative model training across multiple parties without sharing raw data, with semi-asynchronous FL (SAFL) emerging as a balanced approach between synchronous and asynchronous FL. However, SAFL faces significant challenges in optimizing both gradient-based (e.g., FedSGD) and model-based (e.g., FedAvg) aggregation strategies, which exhibit distinct trade-offs in accuracy, convergence speed, and stability. While gradient aggregation achieves faster convergence and higher accuracy, it suffers from pronounced fluctuations, whereas model aggregation offers greater stability but slower convergence and suboptimal accuracy. This paper presents FedQS, the first framework to theoretically analyze and address these disparities in SAFL. FedQS introduces a divide-and-conquer strategy to handle client heterogeneity by classifying clients into four distinct types and adaptively optimizing their local training based on data distribution characteristics and available computational resources. Extensive experiments on computer vision, natural language processing, and real-world tasks demonstrate that FedQS achieves the highest accuracy, attains the lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines. Our work bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning. The code and datasets are available at https://github.com/bkjod/FedQS_.
How to Find Fantastic AI Papers: Self-Rankings as a Powerful Predictor of Scientific Impact Beyond Peer Review
Su, Buxin, Collina, Natalie, Wen, Garrett, Li, Didong, Cho, Kyunghyun, Fan, Jianqing, Zhao, Bingxin, Su, Weijie
Peer review in academic research aims not only to ensure factual correctness but also to identify work of high scientific potential that can shape future research directions. This task is especially critical in fast-moving fields such as artificial intelligence (AI), yet it has become increasingly difficult given the rapid growth of submissions. In this paper, we investigate an underexplored measure for identifying high-impact research: authors' own rankings of their multiple submissions to the same AI conference. Grounded in game-theoretic reasoning, we hypothesize that self-rankings are informative because authors possess unique understanding of their work's conceptual depth and long-term promise. To test this hypothesis, we conducted a large-scale experiment at a leading AI conference, where 1,342 researchers self-ranked their 2,592 submissions by perceived quality. Tracking outcomes over more than a year, we found that papers ranked highest by their authors received twice as many citations as their lowest-ranked counterparts; self-rankings were especially effective at identifying highly cited papers (those with over 150 citations). Moreover, we showed that self-rankings outperformed peer review scores in predicting future citation counts. Our results remained robust after accounting for confounders such as preprint posting time and self-citations. Together, these findings demonstrate that authors' self-rankings provide a reliable and valuable complement to peer review for identifying and elevating high-impact research in AI.
ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Li, Yibo, Xiong, Miao, Wu, Jiaying, Hooi, Bryan
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.
Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction
Lu, Cheng-You, Zhuang, Zhuoli, Le, Nguyen Thanh Trung, Xiao, Da, Chang, Yu-Cheng, Do, Thomas, Sridhar, Srinath, Lin, Chin-teng
Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.
Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
Luo, Beier, Wang, Shuoyuan, Li, Sharon, Wei, Hongxin
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $τ$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $τ$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $τ$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
Bi, Jing, Liang, Susan, Zhou, Xiaofei, Liu, Pinxin, Guo, Junjia, Tang, Yunlong, Song, Luchuan, Huang, Chao, Vosoughi, Ali, Sun, Guangyu, He, Jinxi, Wu, Jiarui, Yang, Shu, Zhang, Daoan, Chen, Chen, Wen, Lianggong Bruce, Liu, Zhang, Luo, Jiebo, Xu, Chenliang
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.
Generalizable and Efficient Automated Scoring with a Knowledge-Distilled Multi-Task Mixture-of-Experts
Fang, Luyang, Wang, Tao, Ma, Ping, Zhai, Xiaoming
Automated scoring of written constructed responses typically relies on separate models per task, straining computational resources, storage, and maintenance in real-world education settings. We propose UniMoE-Guided, a knowledge-distilled multi-task Mixture-of-Experts (MoE) approach that transfers expertise from multiple task-specific large models (teachers) into a single compact, deployable model (student). The student combines (i) a shared encoder for cross-task representations, (ii) a gated MoE block that balances shared and task-specific processing, and (iii) lightweight task heads. Trained with both ground-truth labels and teacher guidance, the student matches strong task-specific models while being far more efficient to train, store, and deploy. Beyond efficiency, the MoE layer improves transfer and generalization: experts develop reusable skills that boost cross-task performance and enable rapid adaptation to new tasks with minimal additions and tuning. On nine NGSS-aligned science-reasoning tasks (seven for training/evaluation and two held out for adaptation), UniMoE-Guided attains performance comparable to per-task models while using $\sim$6$\times$ less storage than maintaining separate students, and $87\times$ less than the 20B-parameter teacher. The method offers a practical path toward scalable, reliable, and resource-efficient automated scoring for classroom and large-scale assessment systems.
SloMo-Fast: Slow-Momentum and Fast-Adaptive Teachers for Source-Free Continual Test-Time Adaptation
Iftee, Md Akil Raihan, Hossain, Mir Sazzat, Rajib, Rakibul Hasan, Iqbal, Tariq, Islam, Md Mofijul, Amin, M Ashraful, Ali, Amin Ahsan, Rahman, AKM Mahbubur
Continual Test-Time Adaptation (CTTA) is crucial for deploying models in real-world applications with unseen, evolving target domains. Existing CTTA methods, however, often rely on source data or prototypes, limiting their applicability in privacy-sensitive and resource-constrained settings. Additionally, these methods suffer from long-term forgetting, which degrades performance on previously encountered domains as target domains shift. To address these challenges, we propose SloMo-Fast, a source-free, dual-teacher CTTA framework designed for enhanced adaptability and generalization. It includes two complementary teachers: the Slow-Teacher, which exhibits slow forgetting and retains long-term knowledge of previously encountered domains to ensure robust generalization, and the Fast-Teacher rapidly adapts to new domains while accumulating and integrating knowledge across them. This framework preserves knowledge of past domains and adapts efficiently to new ones. We also introduce Cyclic Test-Time Adaptation (Cyclic-TTA), a novel CTTA benchmark that simulates recurring domain shifts. Our extensive experiments demonstrate that SloMo-Fast consistently outperforms state-of-the-art methods across Cyclic-TTA, as well as ten other CTTA settings, highlighting its ability to both adapt and generalize across evolving and revisited domains.