Education
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference
Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).
AdvisingWise: Supporting Academic Advising in Higher Education Settings Through a Human-in-the-Loop Multi-Agent Framework
Jiang, Wendan, Wang, Shiyuan, Eltigani, Hiba, Haroon, Rukhshan, Faisal, Abdullah Bin, Dogar, Fahad
Academic advising is critical to student success in higher education, yet high student-to-advisor ratios limit advisors' capacity to provide timely support, particularly during peak periods. Recent advances in Large Language Models (LLMs) present opportunities to enhance the advising process. We present AdvisingWise, a multi-agent system that automates time-consuming tasks, such as information retrieval and response drafting, while preserving human oversight. AdvisingWise leverages authoritative institutional resources and adaptively prompts students about their academic backgrounds to generate reliable, personalized responses. All system responses undergo human advisor validation before delivery to students. We evaluate AdvisingWise through a mixed-methods approach: (1) expert evaluation on responses of 20 sample queries, (2) LLM-as-a-judge evaluation of the information retrieval strategy, and (3) a user study with 8 academic advisors to assess the system's practical utility. Our evaluation shows that AdvisingWise produces accurate, personalized responses. Advisors reported increasingly positive perceptions after using AdvisingWise, as their initial concerns about reliability and personalization diminished. We conclude by discussing the implications of human-AI synergy on the practice of academic advising.
Reference Grounded Skill Discovery
Rho, Seungeun, Trinh, Aaron, Xu, Danfei, Ha, Sehoon
Scaling unsupervised skill discovery algorithms to high-DoF agents remains challenging. As dimensionality increases, the exploration space grows exponentially, while the manifold of meaningful skills remains limited. Therefore, semantic meaningfulness becomes essential to effectively guide exploration in high-dimensional spaces. In this work, we present **Reference-Grounded Skill Discovery (RGSD)**, a novel algorithm that grounds skill discovery in a semantically meaningful latent space using reference data. RGSD first performs contrastive pretraining to embed motions on a unit hypersphere, clustering each reference trajectory into a distinct direction. This grounding enables skill discovery to simultaneously involve both imitation of reference behaviors and the discovery of semantically related diverse behaviors. On a simulated SMPL humanoid with $359$-D observations and $69$-D actions, RGSD successfully imitates skills such as walking, running, punching, and sidestepping, while also discover variations of these behaviors. In downstream locomotion tasks, RGSD leverages the discovered skills to faithfully satisfy user-specified style commands and outperforms imitation-learning baselines, which often fail to maintain the commanded style. Overall, our results suggest that lightweight reference-grounding offers a practical path to discovering semantically rich and structured skills in high-DoF systems.
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Apertus, Project, Hernรกndez-Cano, Alejandro, Hรคgele, Alexander, Huang, Allen Hao, Romanou, Angelika, Solergibert, Antoni-Joan, Pasztor, Barna, Messmer, Bettina, Garbaya, Dhia, ฤurech, Eduard Frank, Hakimi, Ido, Giraldo, Juan Garcรญa, Ismayilzada, Mete, Foroutan, Negar, Moalla, Skander, Chen, Tiancheng, Sabolฤec, Vinko, Xu, Yixuan, Aerni, Michael, AlKhamissi, Badr, Mariรฑas, Inรฉs Altemir, Amani, Mohammad Hossein, Ansaripour, Matin, Badanin, Ilia, Benoit, Harold, Boros, Emanuela, Browning, Nicholas, Bรถsch, Fabian, Bรถther, Maximilian, Canova, Niklas, Challier, Camille, Charmillot, Clement, Coles, Jonathan, Deriu, Jan, Devos, Arnout, Drescher, Lukas, Dzenhaliou, Daniil, Ehrmann, Maud, Fan, Dongyang, Fan, Simin, Gao, Silin, Gila, Miguel, Grandury, Marรญa, Hashemi, Diba, Hoyle, Alexander, Jiang, Jiaming, Klein, Mark, Kucharavy, Andrei, Kucherenko, Anastasiia, Lรผbeck, Frederike, Machacek, Roman, Manitaras, Theofilos, Marfurt, Andreas, Matoba, Kyle, Matrenok, Simon, Mendonรงa, Henrique, Mohamed, Fawzi Roberto, Montariol, Syrielle, Mouchel, Luca, Najem-Meyer, Sven, Ni, Jingwei, Oliva, Gennaro, Pagliardini, Matteo, Palme, Elia, Panferov, Andrei, Paoletti, Lรฉo, Passerini, Marco, Pavlov, Ivan, Poiroux, Auguste, Ponkshe, Kaustubh, Ranchin, Nathan, Rando, Javi, Sauser, Mathieu, Saydaliev, Jakhongir, Sayfiddinov, Muhammad Ali, Schneider, Marian, Schuppli, Stefano, Scialanga, Marco, Semenov, Andrei, Shridhar, Kumar, Singhal, Raghav, Sotnikova, Anna, Sternfeld, Alexander, Tarun, Ayush Kumar, Teiletche, Paul, Vamvas, Jannis, Yao, Xiaozhe, Zhao, Hao, Ilic, Alexander, Klimovic, Ana, Krause, Andreas, Gulcehre, Caglar, Rosenthal, David, Ash, Elliott, Tramรจr, Florian, VandeVondele, Joost, Veraldi, Livio, Rajman, Martin, Schulthess, Thomas, Hoefler, Torsten, Bosselut, Antoine, Jaggi, Martin, Schlag, Imanol
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting `robots.txt` exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
Verma, Abhigya, Puttagunta, Sriram, Subramanian, Seganrasan, Ramachandran, Sravan
GRAFT is a structured multimodal benchmark designed to probe how well LLMs handle instruction following, visual reasoning, and tasks requiring tight visual textual alignment. The dataset is built around programmatically generated charts and synthetically rendered tables, each paired with a carefully constructed, multi step analytical question that depends solely on what can be inferred from the image itself. Responses are formatted in structured outputs such as JSON or YAML, enabling consistent and fine grained evaluation of both reasoning processes and adherence to output specifications. The benchmark further introduces a taxonomy of reasoning operations ranging from comparison and trend identification to ranking, aggregation, proportional estimation, and anomaly detection to support a comprehensive assessment of model capabilities. Taken together, GRAFT provides a unified and scalable framework for evaluating multimodal LLMs on visually grounded, structured reasoning tasks, offering a more rigorous standard for future benchmarking efforts.
LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback
Xu, Chen, Lv, Zhenyu, Lan, Tian, Wang, Xianyang, Ji, Luyao, Cui, Leyang, Yang, Minqiang, Shen, Jian, Dong, Qunxi, Liu, Xiuling, Wang, Juan, Hu, Bin
Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute "gold standard" for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario.
Animating Language Practice: Engagement with Stylized Conversational Agents in Japanese Learning
Rackauckas, Zackary, Hirschberg, Julia
We explore Jouzu, a Japanese language learning application that integrates large language models with anime-inspired conversational agents. Designed to address challenges learners face in practicing natural and expressive dialogue, Jouzu combines stylized character personas with expressive text-to-speech to create engaging conversational scenarios. We conducted a two-week in-the-wild deployment with 52 Japanese learners to examine how such stylized agents influence engagement and learner experience. Our findings show that participants interacted frequently and creatively, with advanced learners demonstrating greater use of expressive forms. Participants reported that the anime-inspired style made practice more enjoyable and encouraged experimenting with different registers. We discuss how stylization shapes willingness to engage, the role of affect in sustaining practice, and design opportunities for culturally grounded conversational AI in computer-assisted language learning (CALL). By framing our findings as an exploration of design and engagement, we highlight opportunities for generalization beyond Japanese contexts and contribute to international HCI scholarship.
Noise tolerance via reinforcement: Learning a reinforced quantum dynamics
The performance of quantum simulations heavily depends on the efficiency of noise mitigation techniques and error correction algorithms. Reinforcement has emerged as a powerful strategy to enhance the efficiency of learning and optimization algorithms. In this study, we demonstrate that a reinforced quantum dynamics can exhibit significant robustness against interactions with a noisy environment. We study a quantum annealing process where, through reinforcement, the system is encouraged to maintain its current state or follow a noise-free evolution. A learning algorithm is employed to derive a concise approximation of this reinforced dynamics, reducing the total evolution time and, consequently, the system's exposure to noisy interactions. This also avoids the complexities associated with implementing quantum feedback in such reinforcement algorithms. The efficacy of our method is demonstrated through numerical simulations of reinforced quantum annealing with one- and two-qubit systems under Pauli noise.
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
Lin, Jingyang, Wu, Jialian, Sun, Ximeng, Wang, Ze, Liu, Jiang, Su, Yusheng, Yu, Xiaodong, Chen, Hao, Luo, Jiebo, Liu, Zicheng, Barsoum, Emad
Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.
Matryoshka Model Learning for Improved Elastic Student Models
Verma, Chetan, Timmaraju, Aditya Srinivas, Hsieh, Cho-Jui, Damle, Suyash, Bui, Ngot, Zhang, Yang, Chen, Wen, Liu, Xin, Jain, Prateek, Dhillon, Inderjit S
Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a key metric. We also demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.