Problem Solving
Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework
Xu, Zenan, Qiu, Zexuan, Huang, Guanhua, Li, Kun, Li, Siheng, Zhang, Chenchen, Li, Kejiao, Yi, Qi, Jiang, Yuhao, Zhou, Bo, Lian, Fengzong, Kang, Zhanhui
Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy...
Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving
Liang, Zhenwen, Song, Linfeng, Li, Yang, Yang, Tao, Zhang, Feng, Mi, Haitao, Yu, Dong
Automated Theorem Proving (ATP) in formal languages is a foundational challenge for AI. While Large Language Models (LLMs) have driven remarkable progress, a significant gap remains between their powerful informal reasoning capabilities and their weak formal proving performance. Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench. We argue this gap persists because current state-of-the-art provers, by tightly coupling reasoning and proving, are trained with paradigms that inadvertently punish deep reasoning in favor of shallow, tactic-based strategies. To bridge this fundamental gap, we propose a novel framework that decouples high-level reasoning from low-level proof generation. Our approach utilizes two distinct, specialized models: a powerful, general-purpose Reasoner to generate diverse, strategic subgoal lemmas, and an efficient Prover to rigorously verify them. This modular design liberates the model's full reasoning potential and bypasses the pitfalls of end-to-end training. We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success. Our decoupled framework successfully solves 5 of these problems, demonstrating a significant step towards automated reasoning on exceptionally difficult mathematical challenges. To foster future research, we release our full dataset of generated and verified lemmas for a wide range of IMO problems, available at https://tencent-imo.github.io/ .
Humanoid World Models: Open World Foundation Models for Humanoid Robotics
Ali, Muhammad Qasim, Sridhar, Aditya, Matiana, Shahbuland, Wong, Alex, Al-Sharman, Mohammad
Humanoid robots, with their human-like form, are uniquely suited for interacting in environments built for people. However, enabling humanoids to reason, plan, and act in complex open-world settings remains a challenge. World models, models that predict the future outcome of a given action, can support these capabilities by serving as a dynamics model in long-horizon planning and generating synthetic data for policy learning. We introduce Humanoid World Models (HWM), a family of lightweight, open-source models that forecast future egocentric video conditioned on humanoid control tokens. We train two types of generative models, Masked Transformers and Flow-Matching, on 100 hours of humanoid demonstrations. Additionally, we explore architectural variants with different attention mechanisms and parameter-sharing strategies. Our parameter-sharing techniques reduce model size by 33-53% with minimal impact on performance or visual fidelity. HWMs are designed to be trained and deployed in practical academic and small-lab settings, such as 1-2 GPUs.
CoRE: Enhancing Metacognition with Label-free Self-evaluation in LRMs
Li, Haoxi, Bai, Sikai, Zhang, Jie, Guo, Song
Large reasoning models (LRMs) have demonstrated impressive capabilities in domains like mathematics and program synthesis. Despite their strong performance, LRMs often exhibit overthinking -- excessive and redundant reasoning steps that introduce inefficiencies during inference. This phenomenon raises an important question for LRM self-evaluation: How can a model autonomously assess the correctness of its own reasoning trajectory without external labels? To address this, we propose Chain-of-Reasoning Embedding (CoRE), a series of hidden states in latent space to enable label-free self-evaluation on intermediate reasoning steps of LRMs, so as to enhance metacognition abilities for improved reasoning efficiency. By analyzing the geometric properties of the CoRE trajectories, we reveal that redundant reasoning usually presents cyclical fluctuations, which correspond to repetitive and unconscious reflection/exploration. Leveraging this insight, we further introduce a training-free, label-free self-evaluation framework, CoRE-Eval, to detect such patterns and dynamically determine whether to terminate reasoning early. Extensive experiments on mathematical reasoning benchmarks (GSM8K, MATH-500, and AIME) and across model sizes from 7B to 32B demonstrate that CoRE-Eval reduces chain-of-thought length by 13.7% to 33.2% while improving answer accuracy by around 10%, achieving 70.0% accuracy on the challenging AIME benchmark with the 32B model.
BlueLM-2.5-3B Technical Report
Xiong, Baojiao, Chen, Boheng, Wang, Chengzhi, Luo, Daxiong, Xu, Dongsheng, Liu, Dongyang, Yang, Fan, Li, Fangyuan, Teng, Fei, Wang, Feng, Qin, Fukang, Peng, Fuquan, Tan, Guanxin, Wang, Guozhi, Yu, Haibo, Gao, Haohao, Liu, Heng, Yang, Hongbo, Zou, Hongjian, Shen, Houzheng, Meng, Hu, Li, Huan, Tan, Hui, Chen, Jiali, Chen, Jianzhao, Zhu, Jinliang, Wang, Kai, Wu, Lei, Liu, Liangbing, Bian, Liuyang, He, Liyan, Liu, Long, Li, Peiwen, Shi, Penggang, Ding, Qi, Hu, Rui, Cao, Shuai, Ren, Shuai, Peng, Shuang, Xie, Teng, Chen, Weiji, Xiang, Weilin, Wu, Weixin, Yin, Xi, Chen, Xiaoxin, Chen, Xu, Wen, Yafei, Hu, Yan, Yang, Yanzhou, Xie, Yina, Chen, Yinghao, Liao, Yixuan, Geng, Yu, Ouyang, Yuanjiang, Yang, Yuanzhuo, He, Yuehua, Peng, Yushuai, Wang, Zhaoxiong, Wang, Zheng, Zhou, Zhibo, Wu, Ziyang
We present BlueLM-2.5-3B, a compact and unified dense Multimodal Large Language Model (MLLM) designed for efficient edge-device deployment, offering strong general-purpose and reasoning capabilities. To the best of our knowledge, this is the first 3B-scale MLLM to support both thinking and non-thinking modes, while also enabling explicit control over thinking token budget. BlueLM-2.5-3B is developed through diversified data curation, key data resampling, hybrid heterogeneous reinforcement learning, and a high-performance training infrastructure. Our model achieves superior multimodal capacity while preserving competitive pure-text performance with only 2.9 billion parameters. We conduct comprehensive evaluations across a broad range of multimodal and text-only benchmarks. In thinking mode, BlueLM-2.5-3B achieves comparable performance to Qwen3-4B on text-only benchmarks, and trails the larger Kimi-VL-A3B-16B by only about 5% on average across multimodal evaluations. In non-thinking mode, it outperforms Qwen2.5-VL-3B on the majority of multimodal benchmarks. Additionally, BlueLM-2.5-3B exhibits exceptional data efficiency. All of the aforementioned performance is achieved with substantially less total training data than Qwen2.5-VL-3B and Qwen3-4B. We hope our work contributes to the advancement of high-performance, on-device MLLMs and provides meaningful insights to the research community.
Automated Reasoning for Vulnerability Management by Design
For securing systems, it is essential to manage their vulnerability posture and design appropriate security controls. Vulnerability management allows to proactively address vulnerabilities by incorporating pertinent security controls into systems designs. Current vulnerability management approaches do not support systematic reasoning about the vulnerability postures of systems designs. To effectively manage vulnerabilities and design security controls, we propose a formally grounded automated reasoning mechanism. We integrate the mechanism into an open-source security design tool and demonstrate its application through an illustrative example driven by real-world challenges. The automated reasoning mechanism allows system designers to identify vulnerabilities that are applicable to a specific system design, explicitly specify vulnerability mitigation options, declare selected controls, and thus systematically manage vulnerability postures.
NDAI-NeuroMAP: A Neuroscience-Specific Embedding Model for Domain-Specific Retrieval
Patel, Devendra, Jain, Aaditya, Verma, Jayant, Rajput, Divyansh, Mahala, Sunil, Khapare, Ketki Suresh, Kalla, Jayateja
The exponential growth in neuroscience research output and clinical data necessitates the development of specialized natural language processing models tailored to this domain. Contemporary embedding models, while demonstrating superior performance on general-purpose benchmarks, exhibit suboptimal efficacy when applied to neuroscience-specific tasks due to their broad training objectives and limited exposure to domain-specific terminologies and conceptual relationships. This limitation significantly constrains the development of advanced applications including patient-centric retrieval-augmented generation (RAG) systems and comprehensive electronic health record (EHR) mining for neurological healthcare applications. To address this critical gap, we present NDAI-NeuroMAP, the first neuroscience-domain-specific dense vector embedding model engineered for high-precision information retrieval tasks. Our methodology encompasses the curation of an extensive domain-specific training corpus comprising 500,000 carefully constructed triplets (query-positive-negative configurations), augmented with 250,000 neuroscience-specific definitional entries and 250,000 structured knowledge-graph triplets derived from authoritative neurological ontologies. We employ a sophisticated fine-tuning approach utilizing the FremyCompany/BioLORD-2023 foundation model, implementing a multi-objective optimization framework combining contrastive learning with triplet-based metric learning paradigms. Comprehensive evaluation on a held-out test dataset comprising approximately 24,000 neuroscience-specific queries demonstrates substantial performance improvements over state-of-the-art general-purpose and biomedical embedding models. These empirical findings underscore the critical importance of domain-specific embedding architectures for neuroscience-oriented RAG systems and related clinical natural language processing applications. The landscape of natural language processing (NLP) has evolved profoundly over the past decade, driven by advances in neural embedding architectures. These models, which transform text into dense, high-dimensional vectors, now support diverse tasks spanning cross-lingual translation to large-scale information retrieval. Early methods, such as the seminal Word2V ec [1] and GloV e [2], introduced static word embeddings that successfully captured semantic relationships through distributional statistics, but failed to account for context, producing identical vectors for terms like "bank" regardless of meaning. Contextualized embedding architectures subsequently overcame these limitations.
Domain Knowledge in Artificial Intelligence: Using Conceptual Modeling to Increase Machine Learning Accuracy and Explainability
Storey, V. C., Parsons, J., Castellanos, A., Tremblay, M., Lukyanenko, R., Maass, W., Castillo, A.
Machine learning enables the extraction of useful information from large, diverse datasets. However, despite many successful applications, machine learning continues to suffer from performance and transparency issues. These challenges can be partially attributed to the limited use of domain knowledge by machine learning models. This research proposes using the domain knowledge represented in conceptual models to improve the preparation of the data used to train machine learning models. We develop and demonstrate a method, called the Conceptual Modeling for Machine Learning (CMML), which is comprised of guidelines for data preparation in machine learning and based on conceptual modeling constructs and principles. To assess the impact of CMML on machine learning outcomes, we first applied it to two real-world problems to evaluate its impact on model performance. We then solicited an assessment by data scientists on the applicability of the method. These results demonstrate the value of CMML for improving machine learning outcomes.
Towards Unified Neurosymbolic Reasoning on Knowledge Graphs
Lin, Qika, Xu, Fangzhi, Lu, Hao, He, Kai, Mao, Rui, Liu, Jun, Cambria, Erik, Feng, Mengling
Knowledge Graph (KG) reasoning has received significant attention in the fields of artificial intelligence and knowledge engineering, owing to its ability to autonomously deduce new knowledge and consequently enhance the availability and precision of downstream applications. However, current methods predominantly concentrate on a single form of neural or symbolic reasoning, failing to effectively integrate the inherent strengths of both approaches. Furthermore, the current prevalent methods primarily focus on addressing a single reasoning scenario, presenting limitations in meeting the diverse demands of real-world reasoning tasks. Unifying the neural and symbolic methods, as well as diverse reasoning scenarios in one model is challenging as there is a natural representation gap between symbolic rules and neural networks, and diverse scenarios exhibit distinct knowledge structures and specific reasoning objectives. To address these issues, we propose a unified neurosymbolic reasoning framework, namely Tunsr, for KG reasoning. Tunsr first introduces a consistent structure of reasoning graph that starts from the query entity and constantly expands subsequent nodes by iteratively searching posterior neighbors. Based on it, a forward logic message-passing mechanism is proposed to update both the propositional representations and attentions, as well as first-order logic (FOL) representations and attentions of each node. In this way, Tunsr conducts the transformation of merging multiple rules by merging possible relations at each step. Finally, the FARI algorithm is proposed to induce FOL rules by constantly performing attention calculations over the reasoning graph. Extensive experimental results on 19 datasets of four reasoning scenarios (transductive, inductive, interpolation, and extrapolation) demonstrate the effectiveness of Tunsr.
NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models
Li, Siyu, Teng, Fei, Cao, Yihong, Yang, Kailun, Li, Zhiyong, Wang, Yaonan
Our approach is motivated by the potential of leveraging noisy synthetic data from driving world models to enhance BEV semantic segmentation. The proposed method investigates a noise-resilient learning framework designed for handling synthetic data with inherent noise. The generated data from different world models exhibits inconsistent road structures at identical viewpoints. Abstract --Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Y et, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. T o fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. This work was supported in part by the National Natural Science Foundation of China (No. U21A20518, No. 61976086, and No. 62473139) and in part by the Open Research Project of the State Key Laboratory of Industrial Control Technology, China (Grant No. ICT2025B20). Wang are with the School of Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, Changsha 410082, China (email: kailun.yang@hnu.edu.cn; Cao is with the Key Laboratory of Big Data Research and Application for Basic Education, Hunan Normal University, Changsha 410006, China.