Goto

Collaborating Authors

 str







Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Yang, Zonghan, Wang, Shengjie, Fu, Kelin, He, Wenyang, Xiong, Weimin, Liu, Yibo, Miao, Yibo, Gao, Bofei, Wang, Yejie, Ma, Yingwei, Li, Yanhao, Liu, Yue, Hu, Zhenxing, Zhang, Kaitai, Wang, Shuyi, Chen, Huarong, Sung, Flood, Liu, Yang, Gao, Yang, Yang, Zhilin, Liu, Tianyu

arXiv.org Artificial Intelligence

A contiguous chunk of lines to search for in the existing sourcecode 4. The dividing line: =======5. The lines to replace into the source code6. The end of the replace block: >>>>>>> REPLACEHere is an example: '''python ### mathweb/flask/app.py<<<<<<< SEARCH from flask import Flask ======= import math from flask import Flask >>>>>>> REPLACE ''' Please note that the * SEARCH/REPLACE * edit REQUIRES PROPER INDENTATION.If you would like to add the line ' print(x)', you mustfully write that out, with all those spaces before the code!Wrap the * SEARCH/REPLACE * edit in blocks '''python...'''.The summary of the key differences between the trajectories should bein the thinking part.


AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

Zhang, Jiayi, Peng, Yiran, Kong, Fanqi, Yang, Cheng, Wu, Yifan, Yu, Zhaoyang, Xiang, Jinyu, Ruan, Jianhao, Wang, Jinlin, Song, Maojia, Liu, HongZhang, Tang, Xiangru, Liu, Bang, Wu, Chenglin, Luo, Yuyu

arXiv.org Artificial Intelligence

Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.


Plantain: Plan-Answer Interleaved Reasoning

Liang, Anthony, Berant, Jonathan, Fisch, Adam, Goyal, Abhimanyu, Krishna, Kalpesh, Eisenstein, Jacob

arXiv.org Artificial Intelligence

Reasoning models often spend a significant amount of time thinking before they generate a visible response. In the meantime, they do not give the user any hints as to whether their reasoning is on the right track, and do not give the user any recourse to stop and correct them if their reasoning is flawed. This creates a frustrating, but unfortunately common, experience: the user's time is wasted while the model reasons from a false premise that could have easily been corrected. In contrast, human speakers typically perform lightweight, incremental grounding acts to ensure that participants in the conversation are on the same page; here we ask if language models can learn to leverage a similar type of behavior? With this motivation, we propose interleaved reasoning (IR), in which the model alternates between thinking and surfacing intermediate responses, as an alternative to the standard "think-then-answer" approach. By providing useful information to the user earlier, IR reduces perceived latency, the time a user waits for an initial output, without compromising the quality of the final response. We further introduce a specialization of interleaved reasoning, Plantain (Plan-Thought-Answer Interleaving), where the first intermediate response is an explicit, step-by-step plan for executing the task. This plan-first strategy allows for user intervention and early feedback for subsequent reasoning steps. We demonstrate that Plantain yields an ~6% improvement in pass@1 across several challenging math reasoning and coding benchmarks, while reducing time-to-first-response by over 60% relative to think-then-answer baselines.


Rethinking Intermediate Representation for VLM-based Robot Manipulation

Tang, Weiliang, Gao, Jialin, Pan, Jia-Hui, Wang, Gang, Li, Li Erran, Liu, Yunhui, Ding, Mingyu, Heng, Pheng-Ann, Fu, Chi-Wing

arXiv.org Artificial Intelligence

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Y et, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar . Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects.


The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Sahoo, Subramanyam

arXiv.org Artificial Intelligence

Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.