Education
Training and Testing with Multiple Splits: A Central Limit Theorem for Split-Sample Estimators
As predictive algorithms grow in popularity, using the same dataset to both train and test a new model has become routine across research, policy, and industry. Sample-splitting attains valid inference on model properties by using separate subsamples to estimate the model and to evaluate it. However, this approach has two drawbacks, since each task uses only part of the data, and different splits can lead to widely different estimates. Averaging across multiple splits, I develop an inference approach that uses more data for training, uses the entire sample for testing, and improves reproducibility. I address the statistical dependence from reusing observations across splits by proving a new central limit theorem for a large class of split-sample estimators under arguably mild and general conditions. Importantly, I make no restrictions on model complexity or convergence rates. I show that confidence intervals based on the normal approximation are valid for many applications, but may undercover in important cases of interest, such as comparing the performance between two models. I develop a new inference approach for such cases, explicitly accounting for the dependence across splits. Moreover, I provide a measure of reproducibility for p-values obtained from split-sample estimators. Finally, I apply my results to two important problems in development and public economics: predicting poverty and learning heterogeneous treatment effects in randomized experiments. I show that my inference approach with repeated cross-fitting achieves better power than previous alternatives, often enough to find statistical significance that would otherwise be missed.
Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction
He, Yiting, Liu, Zhishuai, Wang, Weixin, Xu, Pan
Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs with $f$-divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.
Introducing LongCat-Flash-Thinking: A Technical Report
Meituan LongCat Team, null, Gui, Anchun, Li, Bei, Tao, Bingyang, Zhou, Bole, Chen, Borun, Zhang, Chao, Zhang, Chao, Han, Chengcheng, Yang, Chenhui, Zhang, Chi, Peng, Chong, Zhang, Chuyu, Chen, Cong, Li, Fengcun, Xu, Gang, Lin, Guoyuan, Jiang, Hao, Liang, Hao, Fu, Haomin, Ma, Haoxiang, Liu, Hong, Hao, Hongyan, Tang, Hongyin, Zang, Hongyu, Ni, Hongzhi, Su, Hui, Liu, Jiahao, Li, Jiahuan, Liu, Jialin, Zhang, Jianfei, Xu, Jianhao, Wang, Jianing, Sun, Jiaqi, Zhang, Jiaqi, Shi, Jiarong, Yang, Jiawei, Wang, Jingang, Ding, Jinrui, Kuang, Jun, Xu, Jun, He, Ke, Zhang, Kefeng, Wang, Keheng, He, Keqing, Wei, Li, Shi, Liang, Qiu, Lin, Kong, Lingbin, Liu, Lingchuan, Guo, Linsen, An, Longfei, Xia, Mai, Zhou, Meng, Zhu, Mengshen, Pei, Peng, Jia, Pengcheng, Gu, Qi, Guo, Qi, Huang, Qiong, Chen, Quan, Weng, Quanchi, Weng, Rongxiang, Shao, Ruichen, Li, Rumei, Lei, Shanglin, Du, Shuai, Liu, Shuaikang, Zhou, Shuang, Hu, Shuhao, Xu, Siyu, Gong, Songshan, Liang, Tao, Hu, Tianhao, He, Wei, Shi, Wei, Wang, Wei, Wu, Wei, Zhuo, Wei, Tang, Weifeng, Shi, Wenjie, Zhu, Wenlong, Su, Xi, Liu, Xiangcheng, Xi, Xiangyu, Huang, Xiangzhou, Liu, Xiao, Jiang, Xiaochen, Shi, Xiaowei, Shi, Xiaowen, Li, Xiaoyu, Chen, Xin, Zhao, Xinyue, Huang, Xuan, Zhang, Xuemiao, Cao, Xuezhi, Cai, Xunliang, Zhang, Yajie, Chen, Yang, Liu, Yang, Liu, Yang, Zheng, Yang, Wang, Yaoming, Huo, Yaqi, Sun, Yerui, Lu, Yifan, Li, Yiyang, Xiao, Youshao, Lei, Yuanzhe, Xie, Yuchen, Sun, Yueqing, Zhang, Yufei, Wei, Yuhuai, Qian, Yulei, Zhao, Yunke, Ding, Yuqing, Jiang, Yuwei, Yang, Zhaohua, Chen, Zhengyu, Liu, Zhijian, Xia, Zhikang, Su, Zhongda, Li, Ziran, Wang, Ziwen, Zhuang, Ziyuan, Wang, Zongyu, Yang, Zunyuan
We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.
What Matters in Data for DPO?
Pan, Yu, Cai, Zhongze, Chen, Guanting, Zhong, Huaiyang, Wang, Chonghuan
Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it effectively reduces to supervised fine-tuning on the chosen responses. Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. Our results interpret the mechanism behind some widely adopted strategies and offer practical insights for constructing high-impact preference datasets for LLM alignment.
Graph Learning
Xia, Feng, Peng, Ciyuan, Ren, Jing, Febrinanto, Falih Gozi, Luo, Renqiang, Saikrishna, Vidya, Yu, Shuo, Kong, Xiangjie
Graph learning has rapidly evolved into a critical subfield of machine learning and artificial intelligence (AI). Its development began with early graph-theoretic methods, gaining significant momentum with the advent of graph neural networks (GNNs). Over the past decade, progress in scalable architectures, dynamic graph modeling, multimodal learning, generative AI, explainable AI (XAI), and responsible AI has broadened the applicability of graph learning to various challenging environments. Graph learning is significant due to its ability to model complex, non-Euclidean relationships that traditional machine learning struggles to capture, thus better supporting real-world applications ranging from drug discovery and fraud detection to recommender systems and scientific reasoning. However, challenges like scalability, generalization, heterogeneity, interpretability, and trustworthiness must be addressed to unlock its full potential. This survey provides a comprehensive introduction to graph learning, focusing on key dimensions including scalable, temporal, multimodal, generative, explainable, and responsible graph learning. We review state-of-the-art techniques for efficiently handling large-scale graphs, capturing dynamic temporal dependencies, integrating heterogeneous data modalities, generating novel graph samples, and enhancing interpretability to foster trust and transparency. We also explore ethical considerations, such as privacy and fairness, to ensure responsible deployment of graph learning models. Additionally, we identify and discuss emerging topics, highlighting recent integration of graph learning and other AI paradigms and offering insights into future directions. This survey serves as a valuable resource for researchers and practitioners seeking to navigate the rapidly evolving landscape of graph learning.
FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning
Zhang, Li, Han, Zhongxuan, Feng, Xiaohua, Zhang, Jiaming, Li, Yuyuan, Chen, Chaochao
With the emerging application of Federated Learning (FL) in decision-making scenarios, it is imperative to regulate model fairness to prevent disparities across sensitive groups (e.g., female, male). Current research predominantly focuses on two concepts of group fairness within FL: Global Fairness (overall model disparity across all clients) and Local Fairness (the disparity within each client). However, the non-decomposable, non-differentiable nature of fairness criteria poses two fundamental, unresolved challenges for fair FL: (i) Harmonizing global and local fairness, especially in multi-class setting; (ii) Enabling a controllable, optimal accuracy-fairness trade-off. To tackle these challenges, we propose a novel controllable federated group-fairness calibration framework, named FedFACT. FedFACT identifies the Bayes-optimal classifiers under both global and local fairness constraints, yielding models with minimal performance decline while guaranteeing fairness. Building on the characterization of the optimal fair classifiers, we reformulate fair federated learning as a personalized cost-sensitive learning problem for in-processing and a bi-level optimization for post-processing. Theoretically, we provide convergence and generalization guarantees for FedFACT to approach the near-optimal accuracy under given fairness levels. Extensive experiments on multiple datasets across various data heterogeneity demonstrate that FedFACT consistently outperforms baselines in balancing accuracy and global-local fairness.
AI Literacy Assessment Revisited: A Task-Oriented Approach Aligned with Real-world Occupations
Bogart, Christopher, Warrier, Aparna, Agarwal, Arav, Higashi, Ross, Zhang, Yufan, Flot, Jesse, Savelka, Jaromir, Burte, Heather, Sakr, Majd
As artificial intelligence (AI) systems become ubiquitous in professional contexts, there is an urgent need to equip workers, often with backgrounds outside of STEM, with the skills to use these tools effectively as well as responsibly, that is, to be AI literate. However, prevailing definitions and therefore assessments of AI literacy often emphasize foundational technical knowledge, such as programming, mathematics, and statistics, over practical knowledge such as interpreting model outputs, selecting tools, or identifying ethical concerns. This leaves a noticeable gap in assessing someone's AI literacy for real-world job use. We propose a work-task-oriented assessment model for AI literacy which is grounded in the competencies required for effective use of AI tools in professional settings. We describe the development of a novel AI literacy assessment instrument, and accompanying formative assessments, in the context of a US Navy robotics training program. The program included training in robotics and AI literacy, as well as a competition with practical tasks and a multiple choice scenario task meant to simulate use of AI in a job setting. We found that, as a measure of applied AI literacy, the competition's scenario task outperformed the tests we adopted from past research or developed ourselves. We argue that when training people for AI-related work, educators should consider evaluating them with instruments that emphasize highly contextualized practical skills rather than abstract technical knowledge, especially when preparing workers without technical backgrounds for AI-integrated roles.
"I Like That You Have to Poke Around": Instructors on How Experiential Approaches to AI Literacy Spark Inquiry and Critical Thinking
Warrier, Aparna Maya, Agarwal, Arav, Savelka, Jaromir, Bogart, Christopher, Burte, Heather
As artificial intelligence (AI) increasingly shapes decision-making across domains, there is a growing need to support AI literacy among learners beyond computer science. However, many current approaches rely on programming-heavy tools or abstract lecture-based content, limiting accessibility for non-STEM audiences. This paper presents findings from a study of AI User, a modular, web-based curriculum that teaches core AI concepts through interactive, no-code projects grounded in real-world scenarios. The curriculum includes eight projects; this study focuses on instructor feedback on Projects 5-8, which address applied topics such as natural language processing, computer vision, decision support, and responsible AI. Fifteen community college instructors participated in structured focus groups, completing the projects as learners and providing feedback through individual reflection and group discussion. Using thematic analysis, we examined how instructors evaluated the design, instructional value, and classroom applicability of these experiential activities. Findings highlight instructors' appreciation for exploratory tasks, role-based simulations, and real-world relevance, while also surfacing design trade-offs around cognitive load, guidance, and adaptability for diverse learners. This work extends prior research on AI literacy by centering instructor perspectives on teaching complex AI topics without code. It offers actionable insights for designing inclusive, experiential AI learning resources that scale across disciplines and learner backgrounds.
Steering Language Models with Weight Arithmetic
Fierro, Constanza, Roger, Fabien
Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.
AI Literacy for Community Colleges: Instructors' Perspectives on Scenario-Based and Interactive Approaches to Teaching AI
Warrier, Aparna Maya, Agarwal, Arav, Savelka, Jaromir, Bogart, Christopher A, Burte, Heather
This research category full paper investigates how community college instructors evaluate interactive, no-code AI literacy resources designed for non-STEM learners. As artificial intelligence becomes increasingly integrated into everyday technologies, AI literacy - the ability to evaluate AI systems, communicate with them, and understand their broader impacts - has emerged as a critical skill across disciplines. Yet effective, scalable approaches for teaching these concepts in higher education remain limited, particularly for students outside STEM fields. To address this gap, we developed AI User, an interactive online curriculum that introduces core AI concepts through scenario - based activities set in real - world contexts. This study presents findings from four focus groups with instructors who engaged with AI User materials and participated in structured feedback activities. Thematic analysis revealed that instructors valued exploratory tasks that simulated real - world AI use cases and fostered experimentation, while also identifying challenges related to scaffolding, accessibility, and multi-modal support. A ranking task for instructional support materials showed a strong preference for interactive demonstrations over traditional educational materials like conceptual guides or lecture slides. These findings offer insights into instructor perspectives on making AI concepts more accessible and relevant for broad learner audiences. They also inform the design of AI literacy tools that align with diverse teaching contexts and support critical engagement with AI in higher education.