Problem Solving
KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models
Li, Sibo, Hao, Qianyue, Shang, Yu, Li, Yong
Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real-world applications. This stems from the redundancy of the prevailing frame-to-frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text-conditioned robotic world models by concentrating transformers computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot's motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68 acceleration compared to the frame-to-frame generation baseline, and focusing on the motion-aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real-time robotic control and other domains requiring both efficient and effective world models. Code is released at https://anonymous.4open.science/r/Keyworld-E43D. Robotic world models are generative frameworks that predict future environment states based on an initial observation and a conditioning input (Ding et al., 2024; Agarwal et al., 2025).
Feature Augmentation of GNNs for ILPs: Local Uniqueness Suffices
Han, Qingyu, Li, Qian, Yang, Linxin, Chen, Qian, Shi, Qingjiang, Sun, Ruoyu
Integer Linear Programs (ILPs) are central to real-world optimizations but notoriously difficult to solve. Learning to Optimize (L2O) has emerged as a promising paradigm, with Graph Neural Networks (GNNs) serving as the standard backbone. However, standard anonymous GNNs are limited in expressiveness for ILPs, and the common enhancement of augmenting nodes with globally unique identifiers (UIDs) typically introduces spurious correlations that severely harm generalization. To address this tradeoff, we propose a parsimonious Local-UID scheme based on d-hop uniqueness coloring, which ensures identifiers are unique only within each node's d-hop neighborhood. Building on this scheme, we introduce ColorGNN, which incorporates color information via color-conditioned embeddings, and ColorUID, a lightweight feature-level variant. We prove that for d-layer networks, Local-UIDs achieve the expressive power of Global-UIDs while offering stronger generalization. Extensive experiments show that our approach (i) yields substantial gains on three ILP benchmarks, (ii) exhibits strong OOD generalization on linear programming datasets, and (iii) further improves a general graph-level task when paired with a state-of-the-art method.
StyleBench: Evaluating thinking styles in Large Language Models
Guo, Junyu, Gu, Shangding, Jin, Ming, Spanos, Costas, Lavaei, Javad
The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles--Chain-of-Thought (CoT), Tree-of-Thought (ToT), Algorithm-of-Thought (AoT), Sketch-of-Thought (SoT), and Chain-of-Draft (CoD)--on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT -OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, We open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench. Large Language Models (LLMs) have demonstrated impressive capabilities across a diverse range of tasks, including mathematical reasoning, code generation, and complex question answering (Imani et al., 2023; Wang & Chen, 2023; Tan et al., 2023). A key insight from prior work is that their performance on challenging problems is not merely a function of scale, but is critically dependent on the methods used to guide reasoning (Huang & Y ang, 2025). This has spurred the development of sophisticated prompting techniques designed to structure the model's internal reasoning process. Notable among these are Chain-of-Thought (CoT) (Wei et al., 2022), which decomposes problems into sequential steps, and more advanced paradigms like Tree-of-Thought (ToT) (Y ao et al., 2023), which explores multiple reasoning paths in parallel, and Rea-sonflux (Y ang et al., 2025b), employing high-level templates to explore potential solutions. Performance remains highly sensitive to prompt phrasing and frequently necessitates iterative feedback to achieve robust results (Sel et al., 2023). In response, recent work has sought to automate reasoning strategy selection.
Generalist Robot Manipulation beyond Action Labeled Data
Spiridonov, Alexander, Zaech, Jan-Nico, Nikolov, Nikolay, Van Gool, Luc, Paudel, Danda Pani
Recent advances in generalist robot manipulation leverage pre-trained Vision-Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels - featuring humans and/or robots in action - enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations - improving downstream generalist robot policies - but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.
Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
Wang, Guoxin, Zhao, Jun, Liu, Xinyi, Liu, Yanbo, Cao, Xuyang, Li, Chao, Liu, Zhuoyun, Sun, Qintian, Zhou, Fangru, Xing, Haoqiang, Yang, Zhenhong
Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models
Chen, Zhuofan, He, Jiyuan, Zhang, Yichi, Hu, Xing, Wen, Haoxing, Bai, Jun, Rong, Wenge
Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi-step reasoning and abstract conceptual integration. While recent test-time scaling techniques rely heavily on high-quality, challenging problems, the scarcity of Olympiad-level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom-based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units, cognitive atoms, extracted from human-authored solutions. A diversity-promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint-based recombination mechanism ensures logical soundness and structural validity. The combinatorial nature of the graph structure provides a near-infinite space of reasoning paths, and the walk algorithm systematically explores this space to achieve large-scale synthesis of high-quality problems; meanwhile, by controlling the number of cognitive atoms, we can precisely adjust problem difficulty, ensuring diversity, scalability, and controllability of the generated problems. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high-quality math problem generation.Our code is publicly available at https://github.com/Icarus-1111/CogAtom.
A Chain-of-thought Reasoning Breast Ultrasound Dataset Covering All Histopathology Categories
Yu, Haojun, Li, Youcheng, Niu, Zihan, Zhang, Nan, Gong, Xuantong, Li, Huan, Zou, Zhiying, Qi, Haifeng, Cao, Zhenxiao, Lan, Zijie, Yuan, Xingjian, He, Jiating, Zhang, Haokai, Zhang, Shengtao, Wang, Zicheng, Wang, Dong, Zhao, Ziwei, Chen, Congying, Wang, Yong, Qin, Wangyan, Zhu, Qingli, Wang, Liwei
Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions, with millions of examinations per year. However, publicly available high-quality BUS benchmarks for AI development are limited in data scale and annotation richness. In this work, we present BUS-CoT, a BUS dataset for chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of 10,019 lesions from 4,838 patients and covers all 99 histopathology types. To facilitate research on incentivizing CoT reasoning, we construct the reasoning processes based on observation, feature, diagnosis and pathology labels, annotated and verified by experienced experts. Moreover, by covering lesions of all histopathology types, we aim to facilitate robust AI systems in rare cases, which can be error-prone in clinical practice.
CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs (Brief version)
Chervov, A., Fedoriaka, D., Konstantinova, E., Naumov, A., Kiselev, I., Sheveleva, A., Koltsov, I., Lytkin, S., Smolensky, A., Soibelman, A., Levkovich-Maslyuk, F., Grimov, R., Volovich, D., Isakov, A., Kostin, A., Litvinov, M., Vilkin-Krom, N., Bidzhiev, A., Krasnyi, A., Evseev, M., Geraseva, E., Grunwald, L., Galkin, S., Koldunov, E., Diner, S., Chevychelov, A., Kudasheva, E., Sychev, A., Kravchenko, A., Kogan, Z., Natyrova, A., Shishina, L., Cheldieva, L., Zamkovoy, V., Kovalenko, D., Papulov, O., Kudashev, S., Shiltsov, D., Turtayev, R., Nikitina, O., Mamayeva, D., Nikolenko, S., Obozov, M., Titarenko, A., Dolgorukova, A., Aparnev, A., Debeaupuis, O., C., S. Alami, Isambert, H.
This is the third paper of the CayleyPy project applying artificial intelligence to problems in group theory. We announce the first public release of CayleyPy, an open source Python library for computations with Cayley and Schreier graphs. Compared with systems such as GAP and Sage, CayleyPy handles much larger graphs and performs several orders of magnitude faster. Using CayleyPy we obtained about 200 new conjectures on Cayley and Schreier graphs, focused on diameters and growth. For many Cayley graphs of symmetric groups Sn we observe quasi polynomial diameter formulas: a small set of quadratic or linear polynomials indexed by n mod s. We conjecture that this is a general phenomenon, giving efficient diameter computation despite the problem being NP hard. We propose a refinement of the Babai type conjecture on diameters of Sn: n^2/2 + 4n upper bounds in the undirected case, compared to previous O(n^2) bounds. We also provide explicit generator families, related to involutions in a square with whiskers pattern, conjectured to maximize the diameter; search confirms this for all n up to 15. We further conjecture an answer to a question posed by V M Glushkov in 1968 on directed Cayley graphs generated by a cyclic shift and a transposition. For nilpotent groups we conjecture an improvement of J S Ellenberg's results on upper unitriangular matrices over Z/pZ, showing linear dependence of diameter on p. Moreover. Some conjectures are LLM friendly, naturally stated as sorting problems verifiable by algorithms or Python code. To benchmark path finding we created more than 10 Kaggle datasets. CayleyPy works with arbitrary permutation or matrix groups and includes over 100 predefined generators. Our growth computation code outperforms GAP and Sage up to 1000 times in speed and size.
World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation
Jiang, Zhennan, Liu, Kai, Qin, Yuxin, Tian, Shuai, Zheng, Yupeng, Zhou, Mingcai, Yu, Chao, Li, Haoran, Zhao, Dongbin
Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines. More visualization results are available at https://world4rl.github.io/.
Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation
Yang, Runyan, Si, Yuke, Gao, Yingying, Feng, Junlan, Deng, Chao, Zhang, Shilei
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.