Education
Quantum generative model on bicycle-sharing system and an application
Nemoto, Fumio, Koike, Nobuyuki, Sato, Daichi, Kawaai, Yuuta, Ohzeki, Masayuki
Recently, bicycle-sharing systems have been implemented in numerous cities, becoming integral to daily life. However, a prevalent issue arises when intensive commuting demand leads to bicycle shortages in specific areas and at particular times. To address this challenge, we employ a novel quantum machine learning model that analyzes time series data by fitting quantum time evolution to observed sequences. This model enables us to capture actual trends in bicycle counts at individual ports and identify correlations between different ports. Utilizing the trained model, we simulate the impact of proactively adding bicycles to high-demand ports on the overall rental number across the system. Given that the core of this method lies in a Monte Carlo simulation, it is anticipated to have a wide range of industrial applications.
GenQuest: An LLM-based Text Adventure Game for Language Learners
Wang, Qiao, Labib, Adnan, Swier, Robert, Hofmeyr, Michael, Yuan, Zheng
GenQuest is a generative text adventure game that leverages Large Language Models (LLMs) to facilitate second language learning through immersive, interactive storytelling. The system engages English as a Foreign Language (EFL) learners in a collaborative "choose-your-own-adventure" style narrative, dynamically generated in response to learner choices. Game mechanics such as branching decision points and story milestones are incorporated to maintain narrative coherence while allowing learner-driven plot development. Key pedagogical features include content generation tailored to each learner's proficiency level, and a vocabulary assistant that provides in-context explanations of learner-queried text strings, ranging from words and phrases to sentences. Findings from a pilot study with university EFL students in China indicate promising vocabulary gains and positive user perceptions. Also discussed are suggestions from participants regarding the narrative length and quality, and the request for multi-modal content such as illustrations.
Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
Banayeeanzade, Amin, Tak, Ala N., Bahrani, Fatemeh, Bolourani, Anahita, Blas, Leonardo, Ferrara, Emilio, Gratch, Jonathan, Karimireddy, Sai Praneeth
The ability to control LLMs' emulated emotional states and personality traits is essential for enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Patwardhan, Tejal, Dias, Rachel, Proehl, Elizabeth, Kim, Grace, Wang, Michele, Watkins, Olivia, Fishman, Simรณn Posada, Aljubeh, Marwan, Thacker, Phoebe, Fauconnet, Laurance, Kim, Natalie S., Chao, Patrick, Miserendino, Samuel, Chabot, Gildas, Li, David, Sharman, Michael, Barr, Alexandra, Glaese, Amelia, Tworek, Jerry
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.
Measuring Language Model Hallucinations Through Distributional Correctness
Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination. Evaluation of language models has commonly focused on whether they produce'correct' or desired outputs in response to given inputs or instructions, as measured using accuracy or probability-based scoring rules that account for confidence in model predictions. However, the paradigm of focusing on a single answer fundamentally misses a critical aspect of evaluating performance: how models distribute their beliefs across the space of possible responses, including the possibility of abstaining from answering in conditions of uncertainty. Recent work (Kalai et al., 2025) provides compelling evidence that language model'hallucinations' persist in-part due to the socio-technical problem of flawed evaluation metrics. Under traditional binary scoring - where correct answers receive a positive score (maximally 1 for perfect correctness), any response like "I don't know" (IDK) receives 0, and incorrect answers also receive 0 - the optimal strategy for any rational agent is to always guess rather than abstain, even when confidence in the guess is minimal. This creates a systematic bias in our evaluation paradigms toward overconfident responses and offers a socio-technical explanation for why language models persist in making confident assertions about uncertain information, i.e., 'hallucinate'.
Influence branching for learning to solve mixed-integer programs online
Strang, Paul, Alรจs, Zacharie, Bissuel, Cรดme, Juan, Olivier, Kedad-Sidhoum, Safia, Rachelson, Emmanuel
On the occasion of the 20th Mixed Integer Program Workshop's computational competition, this work introduces a new approach for learning to solve MIPs online. Influence branching, a new graph-oriented variable selection strategy, is applied throughout the first iterations of the branch and bound algorithm. This branching heuristic is optimized online with Thompson sampling, which ranks the best graph representations of MIP's structure according to computational speed up over SCIP. We achieve results comparable to state of the art online learning methods. Moreover, our results indicate that our method generalizes well to more general online frameworks, where variations in constraint matrix, constraint vector and objective coefficients can all occur and where more samples are available.
Diffusion-Assisted Distillation for Self-Supervised Graph Representation Learning with MLPs
Ahn, Seong Jin, Kim, Myoung-Ho
Abstract--For large-scale applications, there is growing interest in replacing Graph Neural Networks (GNNs) with lightweight Multi-Layer Perceptrons (MLPs) via knowledge distillation. However, distilling GNNs for self-supervised graph representation learning into MLPs is more challenging. This is because the performance of self-supervised learning is more related to the model's inductive bias than supervised learning. This motivates us to design a new distillation method to bridge a huge capacity gap between GNNs and MLPs in self-supervised graph representation learning. In this paper, we propose Diffusion-Assisted Distillation for Self-supervised Graph representation learning with MLPs (DAD-SGM). The proposed method employs a denoising diffusion model as a teacher assistant to better distill the knowledge from the teacher GNN into the student MLP . This approach enhances the generalizability and robustness of MLPs in self-supervised graph representation learning. Extensive experiments demonstrate that DAD-SGM effectively distills the knowledge of self-supervised GNNs compared to state-of-the-art GNN-to-MLP distillation methods. Impact Statement--This paper presents Diffusion-Assisted Distillation for Self-supervised Graph representation learning with MLPs (DAD-SGM), a novel framework that addresses the performance gap between GNNs and MLPs in self-supervised graph learning. Our approach first trains an assistant denoising diffusion model that learns to predict noise from noisy outputs of the GNN teacher .
MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
He, Lixuan, Zheng, Shikang, Zhang, Linfeng
Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling
Tang, Zhengyang, Ye, Zihan, Huang, Chenyu, Huang, Xuhan, Li, Chengpeng, Li, Sihang, Chen, Guanhua, Yan, Ming, Wang, Zizhuo, Zha, Hongyuan, Liu, Dayiheng, Wang, Benyou
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs -- In particular, we show that direct fine-tuning on traditional \textit{non-reflective} datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose \textbf{CALM} (\textit{Corrective Adaptation with Lightweight Modification}), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6\% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop \textbf{STORM} (\textit{Smart Thinking Optimization Reasoning Model}), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9\% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.
Adaptive Federated Learning via Dynamical System Model
Agarwal, Aayushya, Pileggi, Larry, Joshi, Gauri
Hyperparameter selection is critical for stable and efficient convergence of heterogeneous federated learning, where clients differ in computational capabilities, and data distributions are non-IID. Tuning hyperparameters is a manual and computationally expensive process as the hyperparameter space grows combinatorially with the number of clients. To address this, we introduce an end-to-end adaptive federated learning method in which both clients and central agents adaptively select their local learning rates and momentum parameters. Our approach models federated learning as a dynamical system, allowing us to draw on principles from numerical simulation and physical design. Through this perspective, selecting momentum parameters equates to critically damping the system for fast, stable convergence, while learning rates for clients and central servers are adaptively selected to satisfy accuracy properties from numerical simulation. The result is an adaptive, momentum-based federated learning algorithm in which the learning rates for clients and servers are dynamically adjusted and controlled by a single, global hyperparameter. By designing a fully integrated solution for both adaptive client updates and central agent aggregation, our method is capable of handling key challenges of heterogeneous federated learning, including objective inconsistency and client drift. Importantly, our approach achieves fast convergence while being insensitive to the choice of the global hyperparameter, making it well-suited for rapid prototyping and scalable deployment. Compared to state-of-the-art adaptive methods, our framework is shown to deliver superior convergence for heterogeneous federated learning while eliminating the need for hyperparameter tuning both client and server updates.