Instructional Material
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Chen, Hui, Xiong, Miao, Lu, Yujie, Han, Wei, Deng, Ailin, He, Yufei, Wu, Jiaying, Li, Yibo, Liu, Yue, Hooi, Bryan
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
NAACL2025 Tutorial: Adaptation of Large Language Models
Ke, Zixuan, Ming, Yifei, Joty, Shafiq
This tutorial on adaptation of LLMs is designed to address the growing demand for models that go beyond the static capabilities of generic LLMs by providing an overview of dynamic, domain-specific, and task-adaptive LLM adaptation techniques. While general LLMs have demonstrated strong generalization across a variety of tasks, they often struggle to perform well in specialized domains such as finance, healthcare, and code generation for underrepresented languages. Additionally, their static nature limits their ability to evolve with the changing world, and they are often extremely large in size, making them impractical and costly to deploy at scale. As a result, the adaptation of LLMs has drawn much attention since the birth of LLMs and is of core importance, both for industry, which focuses on serving its targeted users, and academia, which can greatly benefit from small but powerful LLMs. To address this gap, this tutorial aims to provide an overview of the LLM adaptation techniques. We start with an introduction to LLM adaptation, from both the data perspective and the model perspective. We then emphasize how the evaluation metrics and benchmarks are different from other techniques. After establishing the problems, we explore various adaptation techniques. We categorize adaptation techniques into two main families. The first is parametric knowledge adaptation, which focuses on updating the parametric knowledge within LLMs. Additionally, we will discuss real-time adaptation techniques, including model editing, which allows LLMs to be updated dynamically in production environments. The second kind of adaptation is semi-parametric knowledge adaptation, where the goal is to update LLM parameters to better leverage external knowledge or tools through techniques like retrieval-augmented generation (RAG) and agent-based systems.
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Lu, Dunjie, Xu, Yiheng, Wang, Junli, Wu, Haoyuan, Wang, Xinyuan, Wang, Zekun, Yang, Junlin, Su, Hongjin, Chen, Jixuan, Chen, Junda, Mao, Yuchen, Zhou, Jingren, Lin, Junyang, Hui, Binyuan, Yu, Tao
Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation
Andrews, Kenya S., Kanubala, Deborah Dormah, Aruleba, Kehinde, Castro, Francisco Enrique Vicente, Revelo, Renata A
Course syllabi set the tone and expectations for courses, shaping the learning experience for both students and instructors. In computing courses, especially those addressing fairness and ethics in artificial intelligence (AI), machine learning (ML), and algorithmic design it is imperative that we understand how approaches to navigating barriers to fair outcomes are being addressed.These expectations should be inclusive, transparent, and grounded in promoting critical thinking. Syllabus analysis offers a way to evaluate the coverage, depth, practices, and expectations within a course. Manual syllabus evaluation, however, is time-consuming and prone to inconsistency. To address this, we developed a justice-oriented scoring rubric and asked a large language model (LLM) to review syllabi through a multi-perspective role simulation. Using this rubric, we evaluated 24 syllabi from four perspectives: instructor, departmental chair, institutional reviewer, and external evaluator. We also prompted the LLM to identify thematic trends across the courses. Findings show that multi-perspective evaluation aids us in noting nuanced, role-specific priorities, leveraging them to fill hidden gaps in curricula design of AI/ML and related computing courses focused on fairness and ethics. These insights offer concrete directions for improving the design and delivery of fairness, ethics, and justice content in such courses.
LLM Bazaar: A Service Design for Supporting Collaborative Learning with an LLM-Powered Multi-Party Collaboration Infrastructure
Wu, Zhen, Shi, Jiaxin, Murray, R. Charles, Rosé, Carolyn, Andres, Micah San
Providing technological support for collaborative and discussion-based learning has long been a focus in CSCL research (Gweon et al., 2006; Kollar et al., 2006; Kumar et al., 2007; Rosé and Ferschke, 2016, Naik et al., 2024). Open - source architectures like Bazaar (Adamson et al., 2014) have enabled implementation of a plethora of dynamic support interventions, even for face - to -face collaboration through multi - modal sensing (Wang et al., 2020), which can be used in a portable fashion for nearly anytime-anywhere collaboration support (Vitiello et al., 2023). Past studies highlight the benefits of interactive and context-sensitive support in group learning (Kumar et al., 2007; Kumar and Rose, 2010). While static scaffolding like fixed prompts (Vogel et al., 2021) and scripted roles (Fischer et al., 2013) have been effective, contextualized interventions within specific conversational contexts (Ai et al., 2010; Cui et al., 2009) or support for student role taking (Gweon; et al., 2007) have also shown positive outcomes. Past studies incorporating dynamic support agents in collaborative learning activities (Kumar et al., 2007; Kumar and Rosé, 2010; Rosé and Ferschke, 2016) have shown the effectiveness of discussion-based learning integrated with conversational support using dialog agents. Finally Sankaranarayanan and colleagues (Sankaranarayanan et al., 2022a; Sankaranarayanan et al., 2022b) have shown the effectiveness of reflection-based learning for collaborative software development, showing that shifting students' focus more towards reflection than actual coding can increase conceptual learning without harming the ability to write code. The contribution of this design paper is the introduction of capabilities from Large Language Models (LLMs) (Vaswani, 2017) to enable new forms of collaborative support agents. While recent studies demonstrate that this new generation of support agents can be effective learning support, the new contribution of this paper is an extension to a publicly available and open-source plat form to easily integrate LLM agents developed in the broader CSCL community in order to facilitate needed research to answer questions about how best to use new AI capabilities to support collaborative learning effectively. We provide code for the LLMbazaar extension, the illustrative instructional example described below, and instructions for obtaining support for using this resource, available on GitHub (Bazaar, 2025).
PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
Bhuiya, Neeladri, Aggarwal, Madhav, Purwar, Diptanshu
Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo
Shariatmadar, Keivan, Osman, Ahmad, Ray, Ramin, Kim, Kisam
Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emph{FST.ai 2.0}, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates {pose-based action recognition} using graph convolutional networks (GCNs), {epistemic uncertainty modeling} through credal sets, and {explainability overlays} for visual decision support. A set of {interactive dashboards} enables human--AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, FST.ai~2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an {85\% reduction in decision review time} and {93\% referee trust} in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, FST.ai~2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.
Samsung's Galaxy XR Mixed Reality Headset Is Here: Price, Release Date, Features
Samsung's Galaxy XR Mixed Reality Headset Undercuts Apple's Vision Pro by $1,700 This Android XR-powered headset comes with Google's Gemini assistant and once again asks you to step into virtual waters. It has been five years since Samsung and Google stopped supporting their respective mobile virtual reality headsets . For a second try, the companies have partnered up with a bolder vision in the mixed reality space, starting with the new Galaxy XR. Announced last year as Project Moohan, it's the first headset powered by Android XR, a new platform for smart glasses and headsets built on Android and Google's Gemini assistant from the ground up. The Galaxy XR is available today in the US and South Korea for $1,800.
Distributed Allocation and Resource Scheduling Algorithms Resilient to Link Failure
Doostmohammadian, Mohammadreza, Pequito, Sergio
Distributed resource allocation (DRA) is fundamental to modern networked systems, spanning applications from economic dispatch in smart grids to CPU scheduling in data centers. Conventional DRA approaches require reliable communication, yet real-world networks frequently suffer from link failures, packet drops, and communication delays due to environmental conditions, network congestion, and security threats. We introduce a novel resilient DRA algorithm that addresses these critical challenges, and our main contributions are as follows: (1) guaranteed constraint feasibility at all times, ensuring resource-demand balance even during algorithm termination or network disruption; (2) robust convergence despite sector-bound nonlinearities at nodes/links, accommodating practical constraints like quantization and saturation; and (3) optimal performance under merely uniformly-connected networks, eliminating the need for continuous connectivity. Unlike existing approaches that require persistent network connectivity and provide only asymptotic feasibility, our graph-theoretic solution leverages network percolation theory to maintain performance during intermittent disconnections. This makes it particularly valuable for mobile multi-agent systems where nodes frequently move out of communication range. Theoretical analysis and simulations demonstrate that our algorithm converges to optimal solutions despite heterogeneous time delays and substantial link failures, significantly advancing the reliability of distributed resource allocation in practical network environments.
The Integration of Artificial Intelligence in Undergraduate Medical Education in Spain: Descriptive Analysis and International Perspectives
Janeiro, Ana Enériz, Pereira, Karina Pitombeira, Mayol, Julio, Crespo, Javier, Carballo, Fernando, Cabello, Juan B., Ramos-Casals, Manel, Corbacho, Bibiana Pérez, Turnes, Juan
AI is transforming medical practice and redefining the competencies that future healthcare professionals need to master. Despite international recommendations, the integration of AI into Medicine curricula in Spain had not been systematically evaluated until now. A cross-sectional study (July-September 2025) including Spanish universities offering the official degree in Medicine, according to the 'Register of Universities, Centers and Degrees (Registro de Universidades, Centros y Títulos RUCT)'. Curricula and publicly available institutional documentation were reviewed to identify courses and competencies related to AI in the 2025-2026 academic year. The analysis was performed using descriptive statistics. Of the 52 universities analyzed, ten (19.2%) offer specific AI courses, whereas 36 (69.2%) include no related content. Most of the identified courses are elective, with a credit load ranging from three to six ECTS, representing on average 1.17% of the total 360 credits of the degree. The University of Jaén is the only institution offering a compulsory course with AI content. The territorial analysis reveals marked disparities: Andalusia leads with 55.5% of its universities incorporating AI training, while several communities lack any initiative in this area. The integration of AI into the medical degree in Spain is incipient, fragmented, and uneven, with a low weight in ECTS. The limited training load and predominance of elective courses restrict the preparation of future physicians to practice in a healthcare environment increasingly mediated by AI. The findings support the establishment of minimum standards and national monitoring of indicators.