Overview
A Survey on LLM Mid-Training
Tu, Chengying, Zhang, Xuemiao, Weng, Rongxiang, Li, Rumei, Zhang, Chen, Bai, Yang, Yan, Hongfei, Wang, Jingang, Cai, Xunliang
Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs. The paradigm shift in foundation model development has transitioned from monolithic pre-training approaches to sophisticated multi-stage optimization frameworks (Ibrahim et al., 2024; Blakeney et al., 2024; Feng et al., 2024; Zhang et al., 2025a;b). While general pre-training establishes fundamental competencies through exposure to diverse large-scale corpora, contemporary research demonstrates that subsequent optimization phases systematically amplify specialized capabilities like mathematics, reasoning, coding, agent, and long-context extension (Grattafiori et al., 2024; Parmar et al., 2024; OLMo et al., 2025). This evolution reflects a growing consensus that general pre-training may not effectively or sufficiently cultivate the capabilities required in specialized domains, particularly those that demand sustained access to high-quality data sources. The demonstrated potential of intermediate optimization phases has catalyzed their formalization as a distinct developmental stage, which is now gradually being recognized as the mid-training stage. Mid-training is positioned as the critical bridge between general pre-training and post-training stages, characterized by intermediate computational demands and targeted large-scale data utilization. The mid-training stage has proven its capacity for bidirectional capability balance: forward-propagating specialized capabilities potential through curriculum-guided exposure to domain-specific data, while simultaneously backward-preserving general competencies via a reserved general data ratio. While pre-training focuses on establishing foundational capabilities, mid-training aims to preserve these foundations while amplifying targeted competencies.
Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025
Alsentzer, Emily, Charpignon, Marie-Laure, Chen, Bill, D'Souza, Niharika, Fries, Jason, Jiang, Yixing, Kashyap, Aparajita, Kim, Chanwoo, Lee, Simon, Mandyam, Aishwarya, Mbilinyi, Ashery, Mehandru, Nikita, Nagesh, Nitish, Nuwagira, Brighton, Pierson, Emma, Pillai, Arvind, Sano, Akane, Syeda-Mahmood, Tanveer, Yadav, Shashank, Adhanom, Elias, Afza, Muhammad Umar, Archer, Amelia, Bedi, Suhana, Bikia, Vasiliki, Chang, Trenton, Chen, George H., Chen, Winston, Chiang, Erica, Choi, Edward, Ciora, Octavia, Dozie-Nnamah, Paz, Elsharief, Shaza, Engelhard, Matthew, Eshragh, Ali, Feng, Jean, Fessel, Josh, Fleming, Scott, Fong, Kei Sen, Frost, Thomas, Gadgil, Soham, Gichoya, Judy, Hershkovich, Leeor, Im, Sujeong, Jain, Bhavya, Jeanselme, Vincent, Jia, Furong, Jin, Qixuan, Jin, Yuxuan, Kapash, Daniel, Kapoor, Geetika, Kiafar, Behdokht, Kleiner, Matthias, Kraft, Stefan, Kumar, Annika, Kyung, Daeun, Liang, Zhongyuan, Lin, Joanna, Liu, Qianchu, Liu, Chang, Luan, Hongzhou, Lunt, Chris, Lรณpez, Leopoldo Julรญan Lechuga, McDermott, Matthew B. A., Noroozizadeh, Shahriar, O'Brien, Connor, Oh, YongKyung, Ota, Mixail, Pfohl, Stephen, Pi, Meagan, Pias, Tanmoy Sarkar, Rocheteau, Emma, Sethi, Avishaan, Shirakawa, Toru, Silver, Anita, Simha, Neha, Stankeviciute, Kamile, Sunog, Max, Szolovits, Peter, Tang, Shengpu, Tang, Jialu, Tierney, Aaron, Valdovinos, John, Wallace, Byron, Wang, Will Ke, Washington, Peter, Weiss, Jeremy, Wolfe, Daniel, Wong, Emily, Yun, Hye Sun, Zhang, Xiaoman, Zhang, Xiao Yu Cindy, Jeong, Hayoung, Thakoor, Kaveri A.
The 6th annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year's program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, creative exploration of emerging opportunities, and collective ideation toward actionable directions in the field. Overall, the Research Roundtables brought together a diverse mix of participants, including academic researchers, clinicians, industry professionals, and policy experts. In total, eight roundtables were held across two 30-minute sessions, with a brief transition break to allow participants to join multiple discussions.
Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning
Chen, Xinghao, Zhao, Anhao, Xia, Heming, Lu, Xuan, Wang, Hanlin, Chen, Yanjun, Zhang, Wei, Wang, Jian, Li, Wenjie, Shen, Xiaoyu
Large Language Models (LLMs) have shown impressive performance on complex tasks through Chain-of-Thought (CoT) reasoning. However, conventional CoT relies on explicitly verbalized intermediate steps, which constrains its broader applicability, particularly in abstract reasoning tasks beyond language. To address this, there has been growing research interest in \textit{latent CoT reasoning}, where the reasoning process is embedded within latent spaces. By decoupling reasoning from explicit language generation, latent CoT offers the promise of richer cognitive representations and facilitates more flexible, faster inference. This paper aims to present a comprehensive overview of this emerging paradigm and establish a systematic taxonomy. We analyze recent advances in methods, categorizing them from token-wise horizontal approaches to layer-wise vertical strategies. We then provide in-depth discussions of these methods, highlighting their design principles, applications, and remaining challenges. We hope that our survey provides a structured foundation for advancing this promising direction in LLM reasoning. The relevant papers will be regularly updated at https://github.com/EIT-NLP/Awesome-Latent-CoT.
Editing Across Languages: A Survey of Multilingual Knowledge Editing
Durrani, Nadir, Mousi, Basel, Dalvi, Fahim
While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks,summarize key findings on method effectiveness and transfer patterns, identify challenges in cross-lingual propagation, and highlight open problems related to language anisotropy, evaluation coverage, and edit scalability. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.
MARFT: Multi-Agent Reinforcement Fine-Tuning
Liao, Junwei, Wen, Muning, Wang, Jun, Zhang, Weinan
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.
Trustworthy AI Must Account for Interactions
Trustworthy AI encompasses many aspirational aspects for aligning AI systems with human values, including fairness, privacy, robustness, explainability, and uncertainty quantification. Ultimately the goal of Trustworthy AI research is to achieve all aspects simultaneously. However, efforts to enhance one aspect often introduce unintended trade-offs that negatively impact others. In this position paper, we review notable approaches to these five aspects and systematically consider every pair, detailing the negative interactions that can arise. For example, applying differential privacy to model training can amplify biases, undermining fairness. Drawing on these findings, we take the position that current research practices of improving one or two aspects in isolation are insufficient. Instead, research on Trustworthy AI must account for interactions between aspects and adopt a holistic view across all relevant axes at once. To illustrate our perspective, we provide guidance on how practitioners can work towards integrated trust, examples of how interactions affect the financial industry, and alternative views.
PolyG: Adaptive Graph Traversal for Diverse GraphRAG Questions
Liu, Renjie, Jiang, Haitian, Yan, Xiao, Tang, Bo, Li, Jinyang
GraphRAG enhances large language models (LLMs) to generate quality answers for user questions by retrieving related facts from external knowledge graphs. However, current GraphRAG methods are primarily evaluated on and overly tailored for knowledge graph question answering (KGQA) benchmarks, which are biased towards a few specific question patterns and do not reflect the diversity of real-world questions. To better evaluate GraphRAG methods, we propose a complete four-class taxonomy to categorize the basic patterns of knowledge graph questions and use it to create PolyBench, a new GraphRAG benchmark encompassing a comprehensive set of graph questions. With the new benchmark, we find that existing GraphRAG methods fall short in effectiveness (i.e., quality of the generated answers) and/or efficiency (i.e., response time or token usage) because they adopt either a fixed graph traversal strategy or free-form exploration by LLMs for fact retrieval. However, different question patterns require distinct graph traversal strategies and context formation. To facilitate better retrieval, we propose PolyG, an adaptive GraphRAG approach by decomposing and categorizing the questions according to our proposed question taxonomy. Built on top of a unified interface and execution engine, PolyG dynamically prompts an LLM to generate a graph database query to retrieve the context for each decomposed basic question. Compared with SOTA GraphRAG methods, PolyG achieves a higher win rate in generation quality and has a low response latency and token cost. Our code and benchmark are open-source at https://github.com/Liu-rj/PolyG.
A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering
Jamet, Alexandre Valentin, Vavouliotis, Georgios, Jimรฉnez, Daniel A., Alvarez, Lluc, Casas, Marc
To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named First Level Predictor (FLP) and Second Level Predictor (SLP). FLP performs accurate off-chip prediction by using several program features based on virtual addresses and a novel selective delay component. The novelty of SLP relies on leveraging off-chip prediction to drive L1D prefetch filtering by using physical addresses and the FLP prediction as features. TLP constitutes the first hardware proposal targeting both off-chip prediction and prefetch filtering using a multi-level perceptron hardware approach. TLP only requires 7KB of storage. To demonstrate the benefits of TLP we compare its performance with state-of-the-art approaches using off-chip prediction and prefetch filtering on a wide range of single-core and multi-core workloads. Our experiments show that TLP reduces the average DRAM transactions by 30.7% and 17.7%, as compared to a baseline using state-of-the-art cache prefetchers but no off-chip prediction mechanism, across the single-core and multi-core workloads, respectively, while recent work significantly increases DRAM transactions. As a result, TLP achieves geometric mean performance speedups of 6.2% and 11.8% across single-core and multi-core workloads, respectively. In addition, our evaluation demonstrates that TLP is effective independently of the L1D prefetching logic.
Advancing Cognitive Science with LLMs
Cognitive science faces ongoing challenges in knowledge synthesis and conceptual clarity, in part due to its multifaceted and interdisciplinary nature. Recent advances in artificial intelligence, particularly the development of large language models (LLMs), offer tools that may help to address these issues. This review examines how LLMs can support areas where the field has historically struggled, including establishing cross-disciplinary connections, formalizing theories, developing clear measurement taxonomies, achieving generalizability through integrated modeling frameworks, and capturing contextual and individual variation. We outline the current capabilities and limitations of LLMs in these domains, including potential pitfalls. Taken together, we conclude that LLMs can serve as tools for a more integrative and cumulative cognitive science when used judiciously to complement, rather than replace, human expertise.
Effectiveness of LLMs in Temporal User Profiling for Recommendation
Sabouri, Milad, Mansoury, Masoud, Lin, Kun, Mobasher, Bamshad
Effectively modeling the dynamic nature of user preferences is crucial for enhancing recommendation accuracy and fostering transparency in recommender systems. Traditional user profiling often overlooks the distinction between transitory short-term interests and stable long-term preferences. This paper examines the capability of leveraging Large Language Models (LLMs) to capture these temporal dynamics, generating richer user representations through distinct short-term and long-term textual summaries of interaction histories. Our observations suggest that while LLMs tend to improve recommendation quality in domains with more active user engagement, their benefits appear less pronounced in sparser environments. This disparity likely stems from the varying distinguishability of short-term and long-term preferences across domains; the approach shows greater utility where these temporal interests are more clearly separable (e.g., Movies\&TV) compared to domains with more stable user profiles (e.g., Video Games). This highlights a critical trade-off between enhanced performance and computational costs, suggesting context-dependent LLM application. Beyond predictive capability, this LLM-driven approach inherently provides an intrinsic potential for interpretability through its natural language profiles and attention weights. This work contributes insights into the practical capability and inherent interpretability of LLM-driven temporal user profiling, outlining new research directions for developing adaptive and transparent recommender systems.