Education
Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering
Balamurali, Sai Shridhar, Cheng, Lu
Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.
Enabling Automatic Self-Talk Detection via Earables
Lee, Euihyeok, Kim, Seonghyeon, Im, SangHun, Oh, Heung-Seon, Kang, Seungwoo
Self-talk-an internal dialogue that can occur silently or be spoken aloud-plays a crucial role in emotional regulation, cognitive processing, and motivation, yet has remained largely invisible and unmeasurable in everyday life. In this paper, we present MutterMeter, a mobile system that automatically detects vocalized self-talk from audio captured by earable microphones in real-world settings. Detecting self-talk is technically challenging due to its diverse acoustic forms, semantic and grammatical incompleteness, and irregular occurrence patterns, which differ fundamentally from assumptions underlying conventional speech understanding models. To address these challenges, MutterMeter employs a hierarchical classification architecture that progressively integrates acoustic, linguistic, and contextual information through a sequential processing pipeline, adaptively balancing accuracy and computational efficiency. We build and evaluate MutterMeter using a first-of-its-kind dataset comprising 31.1 hours of audio collected from 25 participants. Experimental results demonstrate that MutterMeter achieves robust performance with a macro-averaged F1 score of 0.84, outperforming conventional approaches, including LLM-based and speech emotion recognition models.
From Hubs to Deserts: Urban Cultural Accessibility Patterns with Explainable AI
Pranto, Protik Bose, Islam, Minhazul, Saha, Ripon Kumar, Rivera, Abimelec Mercado, Abbasov, Namig
Cultural infrastructures, such as libraries, museums, theaters, and galleries, support learning, civic life, health, and local economies, yet access is uneven across cities. We present a novel, scalable, and open-data framework to measure spatial equity in cultural access. We map cultural infrastructures and compute a metric called Cultural Infrastructure Accessibility Score (CIAS) using exponential distance decay at fine spatial resolution, then aggregate the score per capita and integrate socio-demographic indicators. Interpretable tree-ensemble models with SHapley Additive exPlanation (SHAP) are used to explain associations between accessibility, income, density, and tract-level racial/ethnic composition. Results show a pronounced core-periphery gradient, where non-library cultural infrastructures cluster near urban cores, while libraries track density and provide broader coverage. Non-library accessibility is modestly higher in higher-income tracts, and library accessibility is slightly higher in denser, lower-income areas.
Towards Personalized Quantum Federated Learning for Anomaly Detection
Rahman, Ratun, Shaham, Sina, Nguyen, Dinh C.
Anomaly detection has a significant impact on applications such as video surveillance, medical diagnostics, and industrial monitoring, where anomalies frequently depend on context and anomaly-labeled data are limited. Quantum federated learning (QFL) overcomes these concerns by distributing model training among several quantum clients, consequently eliminating the requirement for centralized quantum storage and processing. However, in real-life quantum networks, clients frequently differ in terms of hardware capabilities, circuit designs, noise levels, and how classical data is encoded or preprocessed into quantum states. These differences create inherent heterogeneity across clients - not just in their data distributions, but also in their quantum processing behaviors. As a result, training a single global model becomes ineffective, especially when clients handle imbalanced or non-identically distributed (non-IID) data. To address this, we propose a new framework called personalized quantum federated learning (PQFL) for anomaly detection. PQFL enhances local model training at quantum clients using parameterized quantum circuits and classical optimizers, while introducing a quantum-centric personalization strategy that adapts each client's model to its own hardware characteristics and data representation. Extensive experiments show that PQFL significantly improves anomaly detection accuracy under diverse and realistic conditions. Compared to state-of-the-art methods, PQFL reduces false errors by up to 23%, and achieves gains of 24.2% in AUROC and 20.5% in AUPR, highlighting its effectiveness and scalability in practical quantum federated settings.
Exploring the Psychometric Validity of AI-Generated Student Responses: A Study on Virtual Personas' Learning Motivation
This study explores whether large language models (LLMs) can simulate valid student responses for educational measurement. Using GPT - 4o, 2000 virtual student personas were generated. Each persona completed the Academic Motivation Scale (AMS). Factor analys es(EFA and CFA) and clustering showed GPT - 4o reproduced the AMS structure and distinct motivational subgroups.
Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-Tuning
Zhao, Yingnan, Wang, Xinmiao, Wang, Dewei, Liu, Xinzhe, Lu, Dan, Han, Qilong, Liu, Peng, Bai, Chenjia
Humanoid robots are promising to learn a diverse set of human-like locomotion behaviors, including standing up, walking, running, and jumping. However, existing methods predominantly require training independent policies for each skill, yielding behavior-specific controllers that exhibit limited generalization and brittle performance when deployed on irregular terrains and in diverse situations. To address this challenge, we propose Adaptive Humanoid Control (AHC) that adopts a two-stage framework to learn an adaptive humanoid locomotion controller across different skills and terrains. Specifically, we first train several primary locomotion policies and perform a multi-behavior distillation process to obtain a basic multi-behavior controller, facilitating adaptive behavior switching based on the environment. Then, we perform reinforced fine-tuning by collecting online feedback in performing adaptive behaviors on more diverse terrains, enhancing terrain adaptability for the controller. We conduct experiments in both simulation and real-world experiments in Unitree G1 robots. The results show that our method exhibits strong adaptability across various situations and terrains.
DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry
Wu, Changti, Lian, Shijie, Liu, Zihao, Zhang, Lei, Yang, Laurence Tianruo, Chen, Kai
However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. T o address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at DynaSolidGeo.
Re:Member: Emotional Question Generation from Personal Memories
Rackauckas, Zackary, Minematsu, Nobuaki, Hirschberg, Julia
We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.
Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach
This paper studies policy transfer, one of the well-known transfer learning techniques adopted in large language models, for two classes of continuous-time reinforcement learning problems. In the first class of continuous-time linear-quadratic systems with Shannon's entropy regularization (a.k.a. LQRs), we fully exploit the Gaussian structure of their optimal policy and the stability of their associated Riccati equations. In the second class where the system has possibly non-linear and bounded dynamics, the key technical component is the stability of diffusion SDEs which is established by invoking the rough path theory. Our work provides the first theoretical proof of policy transfer for continuous-time RL: an optimal policy learned for one RL problem can be used to initialize the search for a near-optimal policy in a closely related RL problem, while maintaining the convergence rate of the original algorithm. To illustrate the benefit of policy transfer for RL, we propose a novel policy learning algorithm for continuous-time LQRs, which achieves global linear convergence and local super-linear convergence. As a byproduct of our analysis, we derive the stability of a concrete class of continuous-time score-based diffusion models via their connection with LQRs.
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Whitehouse, Chenxi, Ruder, Sebastian, Lin, Tony, Kurylo, Oksana, Takagi, Haruka, Lam, Janice, Busetto, Nicolò, Diaz, Denise, Guzmán, Francisco
Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.Dataset https://huggingface.co/datasets/facebook/menlo In order for LLMs to be most useful across the globe, they need to be able to provide high-quality responses in many languages. Responses should be relevant (Zhuang et al., 2024), factually accurate (Jacovi et al., 2025), and natural (Marchisio et al., 2024; Guo et al., 2025), among other considerations. Ultimately, for interaction in any language to be seamless, responses need to be indistinguishable from those of a native speaker (Novikova et al., 2016; Liu et al., 2021). Language proficiency in humans has traditionally been evaluated via standardized tests (Jamieson et al., 2000). While such tests have been applied to evaluating LLMs (Anil et al., 2023; Mayor-Rocher et al., 2024; Lothritz & Cabot, 2025), they are difficult to scale and do not readily correspond to real-world conversations.