Education
Incentivizing Truthful Language Models via Peer Elicitation Games
Chen, Baiting, Zhu, Tong, Han, Jiale, Li, Lexin, Li, Gang, Dai, Xiaowu
Large Language Models (LLMs) have demonstrated strong generative capabilities but remain prone to inconsistencies and hallucinations. We introduce Peer Elicitation Games (PEG), a training-free, game-theoretic framework for aligning LLMs through a peer elicitation mechanism involving a generator and multiple discriminators instantiated from distinct base models. Discriminators interact in a peer evaluation setting, where utilities are computed using a determinant-based mutual information score that provably incentivizes truthful reporting without requiring ground-truth labels. We establish theoretical guarantees showing that each agent, via online learning, achieves sublinear regret in the sense their cumulative performance approaches that of the best fixed truthful strategy in hindsight. Moreover, we prove last-iterate convergence to a truthful Nash equilibrium, ensuring that the actual policies used by agents converge to stable and truthful behavior over time. Empirical evaluations across multiple benchmarks demonstrate significant improvements in factual accuracy. These results position PEG as a practical approach for eliciting truthful behavior from LLMs without supervision or fine-tuning.
Closing the Sim2Real Performance Gap in RL
Anand, Akhil S, Sawant, Shambhuraj, Hoffmann, Jasper, Reinhardt, Dirk, Gros, Sebastien
Sim2Real aims at training policies in high-fidelity simulation environments and effectively transferring them to the real world. Despite the developments of accurate simulators and Sim2Real RL approaches, the policies trained purely in simulation often suffer significant performance drops when deployed in real environments. This drop is referred to as the Sim2Real performance gap. Current Sim2Real RL methods optimize the simulator accuracy and variability as proxies for real-world performance. However, these metrics do not necessarily correlate with the real-world performance of the policy as established theoretically and empirically in the literature. We propose a novel framework to address this issue by directly adapting the simulator parameters based on real-world performance. We frame this problem as a bi-level RL framework: the inner-level RL trains a policy purely in simulation, and the outer-level RL adapts the simulation model and in-sim reward parameters to maximize real-world performance of the in-sim policy. We derive and validate in simple examples the mathematical tools needed to develop bi-level RL algorithms that close the Sim2Real performance gap.
Towards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues
He, Liqun, Mavrikis, Manolis, Cukurova, Mutlu
Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner - LLM int eractions. To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner - LLM dialogues. The proposed approach involves dialogue d ata collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building. Early insights are outlined as an initial step toward future research. The work underscores the need to evaluate LLM - based educational applications by focusing on dialogue dynamics and pedagogical strategies.
Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Cao, Min, Zhou, Xinyu, Jiang, Ding, Du, Bo, Ye, Mang, Zhang, Min
Abstract--T ext-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity . Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. T o alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity . The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. The task is similar to the person re-identification task (Re-ID) [2], [3], [4], which involves identifying person images across cameras based on the image query . In contrast to the structured image query in Re-ID, the text query in TIPR takes the form of free, flexible characters, making it more accessible and offering substantial application potential in public safety domains. A key challenge in TIPR is the inherent modality gap between vision and language, driving research toward robust cross-modal alignment. The former aligns global text-image representations at the coarse-grained level via cross-modal matching loss functions (Figure 1(a)), while the latter establishes fine-grained associations between textual entities and image body parts (Figure 1(b)). Despite notable progress in this task, two critical issues remain to be addressed.
ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts
Tan, Zheyue, Li, Zhiyuan, Yuan, Tao, Zhou, Dong, Liu, Weilin, Zhuang, Yueqing, Li, Yadong, Niu, Guowei, Qin, Cheng, Yao, Zhuyu, Liu, Congyi, Xu, Haiyang, Li, Boxun, Dai, Guohao, Zhao, Bo, Wang, Yu
Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.
EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
Naeem, Numaan, Mekki, Abdellah El, Abdul-Mageed, Muhammad
Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students' grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at https://github.com/NaumanNaeem/EduAdapt.
Implicit State Estimation via Video Replanning
Ko, Po-Chen, Mao, Jiayuan, Fu, Yu-Hsiang, Yeh, Hsien-Jeng, Chen, Chu-Rong, Ma, Wei-Chiu, Du, Yilun, Sun, Shao-Hua
Video-based representations have gained prominence in planning and decision-making due to their ability to encode rich spatiotemporal dynamics and geometric relationships. These representations enable flexible and generalizable solutions for complex tasks such as object manipulation and navigation. However, existing video planning frameworks often struggle to adapt to failures at interaction time due to their inability to reason about uncertainties in partially observed environments. To overcome these limitations, we introduce a novel framework that integrates interaction-time data into the planning process. Our approach updates model parameters online and filters out previously failed plans during generation. This enables implicit state estimation, allowing the system to adapt dynamically without explicitly modeling unknown state variables. We evaluate our framework through extensive experiments on a new simulated manipulation benchmark, demonstrating its ability to improve replanning performance and advance the field of video-based decision-making. Learning from videos has gained significant traction in decision-making, as videos capture rich visual and dynamic information while aligning with how humans acquire knowledge. These properties make them a powerful medium for specifying tasks and learning diverse skills across contexts. Recent work has shown the effectiveness of video-based frameworks in enabling robots to learn behaviors such as object manipulation (Li et al., 2024) and navigation (Zhang et al., 2024), highlighting the value of video as a flexible and expressive representation. This paper focuses on video as a planning representation. Given a goal and current observation, video planning systems generate imagined task executions and convert them into robot actions. Unlike symbolic or latent representations, videos naturally encode both perceptual and action information and generalize across tasks and environments. Prior works (Chang et al., 2020; Du et al., 2024a;b) leverage these properties to train universal agents using video-based predictions. Despite promising results, existing video planning frameworks suffer from a crucial limitation: they lack mechanisms to integrate past interactions with the environment and cannot effectively reason about uncertainty due to partial observability.
FineVision: Open Data Is All You Need
Wiedmann, Luis, Zohar, Orr, Mahla, Amir, Wang, Xiaohan, Li, Rui, Frere, Thibaud, von Werra, Leandro, Gosthipaty, Aritra Roy, Marafioti, Andrรฉs
The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
Visibility Allocation Systems: How Algorithmic Design Shapes Online Visibility and Societal Outcomes
Ionescu, Stefania, Forsberg, Robin, Lichtenegger, Elsa, Jaoua, Salima, Jaglan, Kshitijaa, Dorfler, Florian, Hannak, Aniko
Throughout application domains, we now rely extensively on algorithmic systems to engage with ever-expanding datasets of information. Despite their benefits, these systems are often complex (comprising of many intricate tools, e.g., moderation, recommender systems, prediction models), of unknown structure (due to the lack of accompanying documentation), and having hard-to-predict yet potentially severe downstream consequences (due to the extensive use, systematic enactment of existing errors, and many comprising feedback loops). As such, understanding and evaluating these systems as a whole remains a challenge for both researchers and legislators. To aid ongoing efforts, we introduce a formal framework for such visibility allocation systems (VASs) which we define as (semi-)automated systems deciding which (processed) data to present a human user with. We review typical tools comprising VASs and define the associated computational problems they solve. By doing so, VASs can be decomposed into sub-processes and illustrated via data flow diagrams. Moreover, we survey metrics for evaluating VASs throughout the pipeline, thus aiding system diagnostics. Using forecasting-based recommendations in school choice as a case study, we demonstrate how our framework can support VAS evaluation. We also discuss how our framework can support ongoing AI-legislative efforts to locate obligations, quantify systemic risks, and enable adaptive compliance.
Taming Modality Entanglement in Continual Audio-Visual Segmentation
Hong, Yuyang, Yang, Qi, Zhang, Tao, Wang, Zili, Fu, Zhaojin, Ding, Kun, Fan, Bin, Xiang, Shiming
Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.