Goto

Collaborating Authors

 Agents


SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{https://github.com/DorothyDUUU/SWE-Dev}{https://github.com/DorothyDUUU/SWE-Dev}.


ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning

arXiv.org Artificial Intelligence

As AI capabilities advance toward and potentially beyond human-level performance, a natural transition emerges where AI-driven development becomes more efficient than human-centric approaches. A promising pathway toward this transition lies in AI-for-AI (AI4AI), which leverages AI techniques to automate and optimize the design, training, and deployment of AI systems themselves. While LLM-based agents have shown the potential to realize AI4AI, they are often unable to fully leverage the experience accumulated by agents during the exploration of solutions in the reasoning process, leading to inefficiencies and suboptimal performance. To address this limitation, we propose ML-Master, a novel AI4AI agent that seamlessly integrates exploration and reasoning by employing a selectively scoped memory mechanism. This approach allows ML-Master to efficiently combine diverse insights from parallel solution trajectories with analytical reasoning, guiding further exploration without overwhelming the agent with excessive context. We evaluate ML-Master on the MLE-Bench, where it achieves a 29.3% average medal rate, significantly surpassing existing methods, particularly in medium-complexity tasks, while accomplishing this superior performance within a strict 12-hour time constraint-half the 24-hour limit used by previous baselines. These results demonstrate ML-Master's potential as a powerful tool for advancing AI4AI.


M-Predictive Spliner: Enabling Spatiotemporal Multi-Opponent Overtaking for Autonomous Racing

arXiv.org Artificial Intelligence

Unrestricted multi-agent racing presents a significant research challenge, requiring decision-making at the limits of a robot's operational capabilities. While previous approaches have either ignored spatiotemporal information in the decision-making process or been restricted to single-opponent scenarios, this work enables arbitrary multi-opponent head-to-head racing while considering the opponents' future intent. The proposed method employs a KF-based multi-opponent tracker to effectively perform opponent ReID by associating them across observations. Simultaneously, spatial and velocity GPR is performed on all observed opponent trajectories, providing predictive information to compute the overtaking maneuvers. This approach has been experimentally validated on a physical 1:10 scale autonomous racing car, achieving an overtaking success rate of up to 91.65% and demonstrating an average 10.13%-point improvement in safety at the same speed as the previous SotA. These results highlight its potential for high-performance autonomous racing.


RecBayes: Recurrent Bayesian Ad Hoc Teamwork in Large Partially Observable Domains

arXiv.org Artificial Intelligence

This paper proposes RecBayes, a novel approach for ad hoc teamwork under partial observability, a setting where agents are deployed on-the-fly to environments where pre-existing teams operate, that never requires, at any stage, access to the states of the environment or the actions of its teammates. We show that by relying on a recurrent Bayesian classifier trained using past experiences, an ad hoc agent is effectively able to identify known teams and tasks being performed from observations alone. Unlike recent approaches such as PO-GPL (Gu et al., 2021) and FEAT (Rahman et al., 2023), that require at some stage fully observable states of the environment, actions of teammates, or both, or approaches such as ATPO (Ribeiro et al., 2023) that require the environments to be small enough to be tabularly modelled (Ribeiro et al., 2023), in their work up to 4.8K states and 1.7K observations, we show RecBayes is both able to handle arbitrarily large spaces while never relying on either states and teammates' actions. Our results in benchmark domains from the multi-agent systems literature, adapted for partial observability and scaled up to 1M states and 2^125 observations, show that RecBayes is effective at identifying known teams and tasks being performed from partial observations alone, and as a result, is able to assist the teams in solving the tasks effectively.


Learning to flock in open space by avoiding collisions and staying together

arXiv.org Artificial Intelligence

The synchronized flight of bird flocks, exemplified by starling murmurations, is perhaps the most striking example of collective behavior in natural systems, which fascinated scholars for quite a long time [1]. Evolutionary biologists, for instance, have long debated the advantages of living in groups [2], which should offer increased protection from predation by diluting the individual risk and 1 possibly confusing the attackers by the sheer size of the assembly. Flocking behavior involves a high degree of order in the individual directions of motion [3], and has been reproduced by minimal models of self-propelling particles (SPPs), such as Craig Reynolds Boids [4] or the celebrated Vicsek model [5] that has long captivated the attention of statistical physicists and played a pivotal role in the birth of the active matter research field. The essential ingredient of these models is the tendency of individual particles to align their direction of motion with those of their local neighbours, which is enough to promote long range order in systems with finite density (even in two spatial dimensions, due to the non-equilibrium nature of self-propelled particles) such as in toy models with periodic boundary conditions. In open systems, constituted by a finite number of individuals in an open, infinite space, purely alignment interactions are however not enough to maintain group cohesion.


A Novel Indicator for Quantifying and Minimizing Information Utility Loss of Robot Teams

arXiv.org Artificial Intelligence

The timely exchange of information among robots within a team is vital, but it can be constrained by limited wireless capacity. The inability to deliver information promptly can result in estimation errors that impact collaborative efforts among robots. In this paper, we propose a new metric termed Loss of Information Utility (LoIU) to quantify the freshness and utility of information critical for cooperation. The metric enables robots to prioritize information transmissions within bandwidth constraints. We also propose the estimation of LoIU using belief distributions and accordingly optimize both transmission schedule and resource allocation strategy for device-to-device transmissions to minimize the time-average LoIU within a robot team. A semi-decentralized Multi-Agent Deep Deterministic Policy Gradient framework is developed, where each robot functions as an actor responsible for scheduling transmissions among its collaborators while a central critic periodically evaluates and refines the actors in response to mobility and interference. Simulations validate the effectiveness of our approach, demonstrating an enhancement of information freshness and utility by 98%, compared to alternative methods.


Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation

arXiv.org Artificial Intelligence

However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafety-Bench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance. The source code and datasets can be found at https://github.com/ZihanY


Engineering Resilience: An Energy-Based Approach to Sustainable Behavioural Interventions

arXiv.org Artificial Intelligence

Addressing complex societal challenges, such as improving public health, fostering honesty in workplaces, or encouraging eco-friendly behaviour requires effective nudges to influence human behaviour at scale. Intervention science seeks to design such nudges within complex societal systems. While interventions primarily aim to shift the system toward a desired state, less attention is given to the sustainability of that state, which we define in terms of resilience: the system's ability to retain the desired state even under perturbations. In this work, we offer a more holistic perspective to intervention design by incorporating a nature-inspired postulate i.e., lower energy states tend to exhibit greater resilience, as a regularization mechanism within intervention optimization to ensure that the resulting state is also sustainable. Using a simple agent-based simulation where commuters are nudged to choose eco-friendly options (e.g., cycles) over individually attractive but less eco-friendly ones (e.g., cars), we demonstrate how embedding lower energy postulate into intervention design induces resilience. The system energy is defined in terms of motivators that drive its agent's behaviour. By inherently ensuring that agents are not pushed into actions that contradict their motivators, the energy-based approach helps design effective interventions that contribute to resilient behavioural states.


Learning Dexterous Object Handover

arXiv.org Artificial Intelligence

-- Object handover is an important skill that we use daily when interacting with other humans. T o deploy robots in collaborative setting, like houses, being able to receive and handing over objects safely and efficiently becomes a crucial skill. In this work, we demonstrate the use of Reinforcement Learning (RL) for dexterous object handover between two multi-finger hands. Key to this task is the use of a novel reward function based on dual quaternions to minimize the rotation distance, which outperforms other rotation representations such as Euler and rotation matrices. The robustness of the trained policy is experimentally evaluated by testing w.r .t. objects that are not included in the training distribution, and perturbations during the handover process. The results demonstrate that the trained policy successfully perform this task, achieving a total success rate of 94% in the best-case scenario after 100 experiments, thereby showing the robustness of our policy with novel objects. In addition, the best-case performance of the trained policy decreases by only 13.8% when the other robot moves during the handover, proving that our policy is also robust to this type of perturbation, which is common in real-world object handovers. Code and videos can be found here. With the recent focus on humanoid robots, service robots, and human-robot collaboration, several efforts have been made to teach robots how to perform dexterous manipulation tasks, such as collaborative assembly, package manipulation in logistics, and household chores.


Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly

arXiv.org Artificial Intelligence

Drawing real world social inferences usually requires taking into account information from multiple modalities. Language is a particularly powerful source of information in social settings, especially in novel situations where language can provide both abstract information about the environment dynamics and concrete specifics about an agent that cannot be easily visually observed. In this paper, we propose Language-Informed Rational Agent Synthesis (LIRAS), a framework for drawing context-specific social inferences that integrate linguistic and visual inputs. LIRAS frames multimodal social reasoning as a process of constructing structured but situation-specific agent and environment representations - leveraging multimodal language models to parse language and visual inputs into unified symbolic representations, over which a Bayesian inverse planning engine can be run to produce granular probabilistic judgments. On a range of existing and new social reasoning tasks derived from cognitive science experiments, we find that our model (instantiated with a comparatively lightweight VLM) outperforms ablations and state-of-the-art models in capturing human judgments across all domains.