Education
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Wang, Zhilin, Jung, Jaehun, Lu, Ximing, Diao, Shizhe, Evans, Ellie, Zeng, Jiaqi, Molchanov, Pavlo, Choi, Yejin, Kautz, Jan, Dong, Yi
Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench
A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation
Andrews, Kenya S., Kanubala, Deborah Dormah, Aruleba, Kehinde, Castro, Francisco Enrique Vicente, Revelo, Renata A
Course syllabi set the tone and expectations for courses, shaping the learning experience for both students and instructors. In computing courses, especially those addressing fairness and ethics in artificial intelligence (AI), machine learning (ML), and algorithmic design it is imperative that we understand how approaches to navigating barriers to fair outcomes are being addressed.These expectations should be inclusive, transparent, and grounded in promoting critical thinking. Syllabus analysis offers a way to evaluate the coverage, depth, practices, and expectations within a course. Manual syllabus evaluation, however, is time-consuming and prone to inconsistency. To address this, we developed a justice-oriented scoring rubric and asked a large language model (LLM) to review syllabi through a multi-perspective role simulation. Using this rubric, we evaluated 24 syllabi from four perspectives: instructor, departmental chair, institutional reviewer, and external evaluator. We also prompted the LLM to identify thematic trends across the courses. Findings show that multi-perspective evaluation aids us in noting nuanced, role-specific priorities, leveraging them to fill hidden gaps in curricula design of AI/ML and related computing courses focused on fairness and ethics. These insights offer concrete directions for improving the design and delivery of fairness, ethics, and justice content in such courses.
Towards Better Health Conversations: The Benefits of Context-seeking
Sayres, Rory, Hao, Yuexing, Ward, Abbi, Wang, Amy, Freeman, Beverly, Zhan, Serena, Ardila, Diego, Li, Jimmy, Lee, I-Ching, Iurchenko, Anna, Kou, Siyi, Badola, Kartikeya, Hu, Jimmy, Kumar, Bhawesh, Johnson, Keith, Vijay, Supriya, Krogue, Justin, Hassidim, Avinatan, Matias, Yossi, Webster, Dale R., Virmani, Sunny, Liu, Yun, Duong, Quang, Schaekermann, Mike
Navigating health questions can be daunting in the modern information landscape. Large language models (LLMs) may provide tailored, accessible information, but also risk being inaccurate, biased or misleading. We present insights from 4 mixed-methods studies (total N=163), examining how people interact with LLMs for their own health questions. Qualitative studies revealed the importance of context-seeking in conversational AIs to elicit specific details a person may not volunteer or know to share. Context-seeking by LLMs was valued by participants, even if it meant deferring an answer for several turns. Incorporating these insights, we developed a "Wayfinding AI" to proactively solicit context. In a randomized, blinded study, participants rated the Wayfinding AI as more helpful, relevant, and tailored to their concerns compared to a baseline AI. These results demonstrate the strong impact of proactive context-seeking on conversational dynamics, and suggest design patterns for conversational AI to help navigate health topics.
LLM Bazaar: A Service Design for Supporting Collaborative Learning with an LLM-Powered Multi-Party Collaboration Infrastructure
Wu, Zhen, Shi, Jiaxin, Murray, R. Charles, Rosé, Carolyn, Andres, Micah San
Providing technological support for collaborative and discussion-based learning has long been a focus in CSCL research (Gweon et al., 2006; Kollar et al., 2006; Kumar et al., 2007; Rosé and Ferschke, 2016, Naik et al., 2024). Open - source architectures like Bazaar (Adamson et al., 2014) have enabled implementation of a plethora of dynamic support interventions, even for face - to -face collaboration through multi - modal sensing (Wang et al., 2020), which can be used in a portable fashion for nearly anytime-anywhere collaboration support (Vitiello et al., 2023). Past studies highlight the benefits of interactive and context-sensitive support in group learning (Kumar et al., 2007; Kumar and Rose, 2010). While static scaffolding like fixed prompts (Vogel et al., 2021) and scripted roles (Fischer et al., 2013) have been effective, contextualized interventions within specific conversational contexts (Ai et al., 2010; Cui et al., 2009) or support for student role taking (Gweon; et al., 2007) have also shown positive outcomes. Past studies incorporating dynamic support agents in collaborative learning activities (Kumar et al., 2007; Kumar and Rosé, 2010; Rosé and Ferschke, 2016) have shown the effectiveness of discussion-based learning integrated with conversational support using dialog agents. Finally Sankaranarayanan and colleagues (Sankaranarayanan et al., 2022a; Sankaranarayanan et al., 2022b) have shown the effectiveness of reflection-based learning for collaborative software development, showing that shifting students' focus more towards reflection than actual coding can increase conceptual learning without harming the ability to write code. The contribution of this design paper is the introduction of capabilities from Large Language Models (LLMs) (Vaswani, 2017) to enable new forms of collaborative support agents. While recent studies demonstrate that this new generation of support agents can be effective learning support, the new contribution of this paper is an extension to a publicly available and open-source plat form to easily integrate LLM agents developed in the broader CSCL community in order to facilitate needed research to answer questions about how best to use new AI capabilities to support collaborative learning effectively. We provide code for the LLMbazaar extension, the illustrative instructional example described below, and instructions for obtaining support for using this resource, available on GitHub (Bazaar, 2025).
Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration
Liu, Yonghao, Wang, Yajun, Guo, Chunli, Pang, Wei, Li, Ximing, Giunchiglia, Fausto, Feng, Xiaoyue, Guan, Renchu
Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.
Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning
While Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations in authentic learning scenarios remain scarce. This study presents an empirical comparison of three state-of-the-art LLMs on a tutoring task simulating a realistic learning setting. Using a dataset containing a student's responses to ten mixed-format questions with correctness labels, each model was asked to (i) analyze the quiz to identify underlying knowledge components, (ii) infer the student's mastery profile, and (iii) generate targeted guidance for improvement. To mitigate subjectivity and evaluator bias, Gemini was employed as a virtual judge to perform pairwise comparisons across multiple dimensions: accuracy, clarity, actionability, and appropriateness. Results analyzed via the Bradley-Terry model reveal that GPT-4o is generally preferred, producing feedback that is more informative and better structured than its counterparts, whereas DeepSeek-V3 and GLM-4.5 demonstrate intermittent strengths but lower consistency. These findings highlight the feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological insights for subsequent empirical research on LLM-driven personalized learning.
Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning
Tang, Zinan, Gao, Xin, Pei, Qizhi, Pan, Zhuoshi, Cai, Mengzhang, Wu, Jiang, He, Conghui, Wu, Lijun
Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.
Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs
Hari, Vishnu, Panda, Kalpana, Panda, Srikant, Agarwal, Amit, Patel, Hitesh Laxmichand
Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97\% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Zhang, Youjia, Kim, Youngeun, Choi, Young-Geun, Kim, Hongyeob, Liu, Huiling, Hong, Sungeun
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
Actor-Free Continuous Control via Structurally Maximizable Q-Functions
Korkmaz, Yigit, Bhuwania, Urvi, Jain, Ayush, Bıyık, Erdem
Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic's output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at https://github.com/USC-Lira/Q3C.