AITopics

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench

large language model, machine learning, natural language, (22 more...)

2510.18941

Country:

Europe (0.67)
North America > United States (0.28)

Genre: Research Report > New Finding (0.66)

Industry:

Banking & Finance (1.00)
Education (0.92)
Professional Services (0.67)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Andrews, Kenya S., Kanubala, Deborah Dormah, Aruleba, Kehinde, Castro, Francisco Enrique Vicente, Revelo, Renata A

A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation

Course syllabi set the tone and expectations for courses, shaping the learning experience for both students and instructors. In computing courses, especially those addressing fairness and ethics in artificial intelligence (AI), machine learning (ML), and algorithmic design it is imperative that we understand how approaches to navigating barriers to fair outcomes are being addressed.These expectations should be inclusive, transparent, and grounded in promoting critical thinking. Syllabus analysis offers a way to evaluate the coverage, depth, practices, and expectations within a course. Manual syllabus evaluation, however, is time-consuming and prone to inconsistency. To address this, we developed a justice-oriented scoring rubric and asked a large language model (LLM) to review syllabi through a multi-perspective role simulation. Using this rubric, we evaluated 24 syllabi from four perspectives: instructor, departmental chair, institutional reviewer, and external evaluator. We also prompted the LLM to identify thematic trends across the courses. Findings show that multi-perspective evaluation aids us in noting nuanced, role-specific priorities, leveraging them to fill hidden gaps in curricula design of AI/ML and related computing courses focused on fairness and ethics. These insights offer concrete directions for improving the design and delivery of fairness, ethics, and justice content in such courses.

large language model, machine learning, natural language, (18 more...)

2510.18931

Country:

Africa (0.46)
North America > United States (0.46)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Curriculum > Subject-Specific Education (1.00)
Education > Educational Setting > Higher Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Towards Better Health Conversations: The Benefits of Context-seeking

Sayres, Rory, Hao, Yuexing, Ward, Abbi, Wang, Amy, Freeman, Beverly, Zhan, Serena, Ardila, Diego, Li, Jimmy, Lee, I-Ching, Iurchenko, Anna, Kou, Siyi, Badola, Kartikeya, Hu, Jimmy, Kumar, Bhawesh, Johnson, Keith, Vijay, Supriya, Krogue, Justin, Hassidim, Avinatan, Matias, Yossi, Webster, Dale R., Virmani, Sunny, Liu, Yun, Duong, Quang, Schaekermann, Mike

Navigating health questions can be daunting in the modern information landscape. Large language models (LLMs) may provide tailored, accessible information, but also risk being inaccurate, biased or misleading. We present insights from 4 mixed-methods studies (total N=163), examining how people interact with LLMs for their own health questions. Qualitative studies revealed the importance of context-seeking in conversational AIs to elicit specific details a person may not volunteer or know to share. Context-seeking by LLMs was valued by participants, even if it meant deferring an answer for several turns. Incorporating these insights, we developed a "Wayfinding AI" to proactively solicit context. In a randomized, blinded study, participants rated the Wayfinding AI as more helpful, relevant, and tailored to their concerns compared to a baseline AI. These results demonstrate the strong impact of proactive context-seeking on conversational dynamics, and suggest design patterns for conversational AI to help navigate health topics.

large language model, machine learning, natural language, (19 more...)

2510.1888

Country: North America > United States (1.00)

Genre:

Research Report > Strength High (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

LLM Bazaar: A Service Design for Supporting Collaborative Learning with an LLM-Powered Multi-Party Collaboration Infrastructure

Wu, Zhen, Shi, Jiaxin, Murray, R. Charles, Rosé, Carolyn, Andres, Micah San

Providing technological support for collaborative and discussion-based learning has long been a focus in CSCL research (Gweon et al., 2006; Kollar et al., 2006; Kumar et al., 2007; Rosé and Ferschke, 2016, Naik et al., 2024). Open - source architectures like Bazaar (Adamson et al., 2014) have enabled implementation of a plethora of dynamic support interventions, even for face - to -face collaboration through multi - modal sensing (Wang et al., 2020), which can be used in a portable fashion for nearly anytime-anywhere collaboration support (Vitiello et al., 2023). Past studies highlight the benefits of interactive and context-sensitive support in group learning (Kumar et al., 2007; Kumar and Rose, 2010). While static scaffolding like fixed prompts (Vogel et al., 2021) and scripted roles (Fischer et al., 2013) have been effective, contextualized interventions within specific conversational contexts (Ai et al., 2010; Cui et al., 2009) or support for student role taking (Gweon; et al., 2007) have also shown positive outcomes. Past studies incorporating dynamic support agents in collaborative learning activities (Kumar et al., 2007; Kumar and Rosé, 2010; Rosé and Ferschke, 2016) have shown the effectiveness of discussion-based learning integrated with conversational support using dialog agents. Finally Sankaranarayanan and colleagues (Sankaranarayanan et al., 2022a; Sankaranarayanan et al., 2022b) have shown the effectiveness of reflection-based learning for collaborative software development, showing that shifting students' focus more towards reflection than actual coding can increase conceptual learning without harming the ability to write code. The contribution of this design paper is the introduction of capabilities from Large Language Models (LLMs) (Vaswani, 2017) to enable new forms of collaborative support agents. While recent studies demonstrate that this new generation of support agents can be effective learning support, the new contribution of this paper is an extension to a publicly available and open-source plat form to easily integrate LLM agents developed in the broader CSCL community in order to facilitate needed research to answer questions about how best to use new AI capabilities to support collaborative learning effectively. We provide code for the LLMbazaar extension, the illustrative instructional example described below, and instructions for obtaining support for using this resource, available on GitHub (Bazaar, 2025).

large language model, machine learning, natural language, (17 more...)

doi: 10.22318/cscl2025.674934

2510.18877

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.15)

Genre:

Research Report (1.00)
Instructional Material (0.89)

Industry:

Education > Educational Setting (0.94)
Education > Educational Technology > Educational Software > Computer Based Training (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration

Liu, Yonghao, Wang, Yajun, Guo, Chunli, Pang, Wei, Li, Ximing, Giunchiglia, Fausto, Feng, Xiaoyue, Guan, Renchu

Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.

artificial intelligence, dataset, machine learning, (15 more...)

2510.1214

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Information Technology (0.67)
Government > Regional Government (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

Yuan, Bo, Hu, Jiazi

While Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations in authentic learning scenarios remain scarce. This study presents an empirical comparison of three state-of-the-art LLMs on a tutoring task simulating a realistic learning setting. Using a dataset containing a student's responses to ten mixed-format questions with correctness labels, each model was asked to (i) analyze the quiz to identify underlying knowledge components, (ii) infer the student's mastery profile, and (iii) generate targeted guidance for improvement. To mitigate subjectivity and evaluator bias, Gemini was employed as a virtual judge to perform pairwise comparisons across multiple dimensions: accuracy, clarity, actionability, and appropriateness. Results analyzed via the Bradley-Terry model reveal that GPT-4o is generally preferred, producing feedback that is more informative and better structured than its counterparts, whereas DeepSeek-V3 and GLM-4.5 demonstrate intermittent strengths but lower consistency. These findings highlight the feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological insights for subsequent empirical research on LLM-driven personalized learning.

large language model, machine learning, natural language, (19 more...)

2509.05346

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (0.90)
Education > Educational Setting (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Tang, Zinan, Gao, Xin, Pei, Qizhi, Pan, Zhuoshi, Cai, Mengzhang, Wu, Jiang, He, Conghui, Wu, Lijun

Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.

large language model, machine learning, natural language, (14 more...)

2508.21589

Country: Europe > Austria (0.28)

Genre: Research Report (0.83)

Industry:

Education (0.67)
Energy > Renewable > Geothermal > Geothermal Energy Systems and Facilities > Geothermal System for Power Generation > Advanced Geothermal System (AGS) (0.62)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Hari, Vishnu, Panda, Kalpana, Panda, Srikant, Agarwal, Amit, Patel, Hitesh Laxmichand

Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs

Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97\% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.

artificial intelligence, large language model, natural language, (16 more...)

2508.15831

Country:

Asia (1.00)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Zhang, Youjia, Kim, Youngeun, Choi, Young-Geun, Kim, Hongyeob, Liu, Huiling, Hong, Sungeun

Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

artificial intelligence, bayesian inference, machine learning, (17 more...)

2508.15568

Genre: Research Report > New Finding (0.93)

Industry: Education > Educational Setting (0.46)

arXiv.org Machine LearningOct-23-2025

Actor-Free Continuous Control via Structurally Maximizable Q-Functions

Korkmaz, Yigit, Bhuwania, Urvi, Jain, Ayush, Bıyık, Erdem

Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic's output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at https://github.com/USC-Lira/Q3C.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

2510.18828

Genre: Research Report (0.64)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)